2024-01-09

Boosting LLM Credibility with Inline Citations: A Guide

Large Language Models (LLMs) are remarkably intelligent, but as we all know, they are limited by their training data.

Let's say we want to answer the question "Why is Caco's domain cacoos.com?". Here’s what the interaction would look like:

const messages = [
	{
		role: "user",
		text: "Why is Caco's domain cacoos.com?",
	},
];

runSomeBeautifulModel(messages);

You

Why is Caco's domain cacoos.com?

SomeAI

|Caco's domain is cacoos.com likely because it is a unique branding choice for their online presence, combining the name "Caco" with a playful twist on the word "OS" (operating system) or a phonetic play on "cocoa," depending on the context of the business or service.

Mmm

Nope! That's not the reason. But building my own OS could be interesting 🤔

To overcome this knowledge limitation, several ideas have been suggested for integrating external knowledge into Large Language Models (LLMs). The most well-known is Retrieval Augmented Generation (RAG), which involves fetching up-to-date or context-specific data from an external database and incorporating it into an LLM's response generation 1.

Let's apply some external knowledge using a basic RAG approach. I'll add three pieces of knowledge to the model’s context window, assuming they’re the results of a similarity search.

const piecesOfKnowledge = [
	{
		source: "https://some.super.reliable.source.com/caco-os-explained",
		text: "Caco loves apples and oranges. But he loves oranges more than apples. And this has nothing to do with his domain.",
	},
	{
		source: "https://en.wikipedia.org/wiki/CacoOS",
		text: "Caco's full name is Joaquín Ossandón Stanke.",
	},
	{
		source: "https://why-caco-is-named-cacoos-everywhere.org",
		text: "Caco's nickname, cacoos, is derived from the initials of his last names.",
	},
];

const messages = [
	{
		role: "user",
		text: "Why is Caco's domain cacoos.com?",
	},
	{
		role: "bot",
		text: `Pieces of knowledge:\n\n${piecesOfKnowledge.map(({ text }) => text).join('\n')}`,
	},
];

runSomeBeautifulModel(messages);

You

Why is Caco's domain cacoos.com?

SomeAI

|Caco's domain is cacoos.com because it's a play on his nickname "cacoos", derived from the initials of his last names, Ossandón Stanke.

Now it's correct!

Better, but several questions arise:

  • Where did the bot acquire the knowledge from?
  • How did it determine that the pieces of knowledge were relevant to the question?
  • Did the model use its own knowledge to answer the question?

Inline citations

Inline citations are a way to add real, checkable sources to your text. This makes what you write more trustworthy and shows you've really looked into the topic. By telling readers where your info comes from, you make your work more believable 2.

They aren’t new—you see them in scientific papers, books, and on many websites. Imagine if all scientific papers had no references; it’d be hard to trust any of it.

Now, let's look at how to add inline citations in a simple way in LLM responses. This consists of several steps:

  • Add id and title to each piece of knowledge so we can reference them later.
  • Add a system message asking for citations and the format to use.
  • Add the pieces of knowledge in a structured format with clear separators so the LLM can easily distinguish them.
const piecesOfKnowledge = [
	{
		id: 1,
		source: "https://some.super.reliable.source.com/caco-os-explained",
		title: "Caco's favorite fruit",
		text: "Caco loves apples and oranges. But he loves oranges more than apples. And this has nothing to do with his domain.",
	},
	{
		id: 2,
		source: "https://en.wikipedia.org/wiki/CacoOS",
		title: "Caco's full name",
		text: "Caco's full name is Joaquín Ossandón Stanke.",
	},
	{
		id: 3,
		source: "https://why-caco-is-named-cacoos-everywhere.org",
		title: "Caco's nickname",
		text: "Caco's nickname, cacoos, is derived from the initials of his last names.",
	},
];

const systemMessage = `
Answer using all relevant pieces of knowledge. Add inline references next to the phrases where you use them.
If you omit a relevant reference, you will be penalized.

Use the format [citationId] within sentences. Use the exact same "citationId" from the knowledge pieces.

Example:
The capital of Chile is Santiago de Chile[1], and the population is 7 million people[3].
`

const knowledgeText = `
Pieces of knowledge:

${piecesOfKnowledge.map(({ id, title, text }) => `
	<<<
	citationId: "${id}"
	title: "${title}"

	${text}
	>>>
`).join('\n')}
`

const messages = [
	{
		role: "system",
		text: systemMessage,
	},
	{
		role: "user",
		text: "Why is Caco's domain cacoos.com?",
	},
	{
		role: "bot",
		text: knowledgeText,
	},
]

runSomeBeautifulModel(messages);

You

Why is Caco's domain cacoos.com?

SomeAI

|Caco's domain is cacoos.com because "cacoos" is his nickname, which is a combination of his first name, Caco, and the initials of his last names [3]. His full name is Joaquín Ossandón Stanke[2], so the nickname "cacoos" likely derives from "Caco" plus "OS" from Ossandón and "S" from Stanke, hence the domain name cacoos.com.

2. Caco's full name
3. Caco's nickname
Now we're talking!

The model referenced only the two pieces of knowledge relevant to the question, even within the phrases! This lends more credibility to the answer and instantly enhances its reliability.

Here are some improvements to enhance the experience even more:

  • Replace numeric references with a pre-generated identifier (id). Avoid vowels so the LLM doesn’t try to interpret the id as a word.
  • Use regular expressions to replace those references with nice-looking links.
  • Ensure citations are numbered according to their order of appearance.
const piecesOfKnowledge = [
	{
		id: "cDrm4Nt",
		source: "https://some.super.reliable.source.com/caco-os-explained",
		title: "Caco's favorite fruit",
		text: "Caco loves apples and oranges. But he loves oranges more than apples. And this has nothing to do with his domain.",
	},
	{
		id: "Kcrl9pr",
		source: "https://en.wikipedia.org/wiki/CacoOS",
		title: "Caco's full name",
		text: "Caco's full name is Joaquín Ossandón Stanke.",
	},
	{
		id: "wq1A2r3",
		source: "https://why-caco-is-named-cacoos-everywhere.org",
		title: "Caco's nickname",
		text: "Caco's nickname, cacoos, is derived from the initials of his last names.",
	},
];

const systemMessage = `...`

const messages = [...]

runSomeBeautifulModel(messages);

You

Why is Caco's domain cacoos.com?

SomeAI

|

💎

Conclusion

Inline citations are increasingly common in LLM-generated text augmented with external knowledge. They make answers more trustworthy, reliable, and precise.

It’s not just about the reader’s experience; it also improves the model’s performance. Requesting references makes the model more careful in its claims—much like asking people for sources encourages them to stay closer to the truth and to research more thoroughly.