2024-01-09

Boosting LLM Credibility with Inline Citations: A Guide

Large Language Models (LLMs) are remarkably intelligent, but as we all know, they are limited by their training data.

Let's say we want to answer the question "Why is Caco's domain cacoos.com?". This is how the interaction would look like:

const messages = [
	{
		role: "user",
		text: "Why is Caco's domain cacoos.com?",
	},
];

runSomeBeautifulModel(messages);

You

Why is Caco's domain cacoos.com?

SomeAI

|Caco's domain is cacoos.com likely because it is a unique branding choice for their online presence, combining the name "Caco" with a playful twist on the word "OS" (operating system) or a phonetic play on "cocoa," depending on the context of the business or service.

Mmm

Nope! That's not the reason. But building my own OS could be interesting 🤔

To overcome this knowledge limitation, several ideas have been suggested for integrating external knowledge into Large Language Models (LLMs). The most well-known is Retrieval Augmented Generation (RAG), which involves fetching up-to-date or context-specific data from an external database and incorporating it into an LLM's response generation 1.

Let's apply some external knowledge using a basic RAG approach. I'll add three pieces of knowledge into the model's context window, assuming they are the outcomes of a similarity search.

const piecesOfKnowledge = [
	{
		source: "https://some.super.reliable.source.com/caco-os-explained",
		text: "Caco loves apples and oranges. But he loves oranges more than apples. And this has nothing to do with his domain.",
	},
	{
		source: "https://en.wikipedia.org/wiki/CacoOS",
		text: "Caco's full name is Joaquín Ossandón Stanke.",
	},
	{
		source: "https://why-caco-is-named-cacoos-everywhere.org",
		text: "Caco's nickname, cacoos, is derived from the initials of his last names.",
	},
];

const messages = [
	{
		role: "user",
		text: "Why is Caco's domain cacoos.com?",
	},
	{
		role: "bot",
		text: `Pieces of knowledge:\n\n${piecesOfKnowledge.map(({ text }) => text).join('\n')}`,
	},
];

runSomeBeautifulModel(messages);

You

Why is Caco's domain cacoos.com?

SomeAI

|Caco's domain is cacoos.com because it's a play on his nickname "cacoos", derived from the initials of his last names, Ossandón Stanke.

Now it's correct!

Better but, several questions arise:

  • Where did the bot acquire the knowledge from?
  • How did it determine that the pieces of knowledge were relevant to the question?
  • Did the model use its own knowledge to answer the question?

Inline citations

Inline citations are a way to add real, checkable sources to your text. This makes what you write more trustworthy and shows you've really looked into the topic. By telling readers where your info comes from, you make your work more believable 2.

They aren't new - you see them in science papers, books, and some websites. Imagine if all science papers had no references. It'd be tough to know if what they say is true.

Now, let's look at how to add inline citations in a simple way in LLM responses. This consists on several steps:

  • Add id and title to each piece of knowledge so we can reference them later.
  • Add a system message asking for citations and the format to use.
  • Add the pieces of knowledge in a nice format and using separators so the LLM can easily differentiate them.
const piecesOfKnowledge = [
	{
		id: 1,
		source: "https://some.super.reliable.source.com/caco-os-explained",
		title: "Caco's favorite fruit",
		text: "Caco loves apples and oranges. But he loves oranges more than apples. And this has nothing to do with his domain.",
	},
	{
		id: 2,
		source: "https://en.wikipedia.org/wiki/CacoOS",
		title: "Caco's full name",
		text: "Caco's full name is Joaquín Ossandón Stanke.",
	},
	{
		id: 3,
		source: "https://why-caco-is-named-cacoos-everywhere.org",
		title: "Caco's nickname",
		text: "Caco's nickname, cacoos, is derived from the initials of his last names.",
	},
];

const systemMessage = `
Answer with all related pieces of knowledge. Always reference between phrases the ones you use. If you skip one, you will be penalized.

Use the format [citationId] between sentences. Use the exact same "citationId" present in the knowledge pieces.

Example:
The capital of Chile is Santiago de Chile[1], and the population is 7 million people[3].
`

const piecesOfKnowledge = `
Pieces of knowledge:

${piecesOfKnowledge.map(({ id, title, text }) => `
	<<<
	citationId: "${id}"
	title: "${title}"

	${text}
	>>>
`).join('\n')}
`

const messages = [
	{
		role: "system",
		text: systemMessage,
	}
	{
		role: "user",
		text: "Why is Caco's domain cacoos.com?",
	},
	{
		role: "bot",
		text: piecesOfKnowledge,
	},
]

runSomeBeautifulModel(messages);

You

Why is Caco's domain cacoos.com?

SomeAI

|Caco's domain is cacoos.com because "cacoos" is his nickname, which is a combination of his first name, Caco, and the initials of his last names [3]. His full name is Joaquín Ossandón Stanke[2], so the nickname "cacoos" likely derives from "Caco" plus "OS" from Ossandón and "S" from Stanke, hence the domain name cacoos.com.

2. Caco's full name
3. Caco's nickname
Now we're talking!

The model referenced only the two pieces of knowledge relevant to the question, even within the phrases! This lends more credibility to the answer and instantly enhances its reliability.

Here are some improvements to enhance the experience even more:

  • Replace references numbers with a previously generated identifier (id). This prevents the LLM from confusing these numbers with those in the knowledge pieces. Prevent using vocals, so the LLM does't try to understand the meaning of the id.
  • Using regular expressions, replace those references with some nice looking links.
  • Ensure citations are numbered according to their order of appearance.
const piecesOfKnowledge = [
	{
		id: "cDrm4Nt",
		source: "https://some.super.reliable.source.com/caco-os-explained",
		title: "Caco's favorite fruit",
		text: "Caco loves apples and oranges. But he loves oranges more than apples. And this has nothing to do with his domain.",
	},
	{
		id: "Kcrl9pr",
		source: "https://en.wikipedia.org/wiki/CacoOS",
		title: "Caco's full name",
		text: "Caco's full name is Joaquín Ossandón Stanke.",
	},
	{
		id: "wq1A2r3",
		source: "https://why-caco-is-named-cacoos-everywhere.org",
		title: "Caco's nickname",
		text: "Caco's nickname, cacoos, is derived from the initials of his last names.",
	},
];

const systemMessage = `...`

const messages = [...]

runSomeBeautifulModel(messages);

You

Why is Caco's domain cacoos.com?

SomeAI

|

💎

Conclusion

Inline citations are really catching on in texts generated by LLMs with extra knowledge from outside sources. They make the answers more trustworthy, reliable, and right on point.

It's not just about the reader's experience, but also the model's performance. Requesting references makes the model more cautious in its statements. It's similar to conversing with people; asking them for references encourages them to adhere more closely to the truth and conduct thorough research.


At Perhaps, we are using inline citations. As I did in this post, we're not only linking to files or websites but also to pieces of knowledge that the model has acquired after being taught something new. This way, the model references its own knowledge!

Check it out at getperhaps.com.