Retrieval-Augmented Prompting (RAG)

Why it matters

A model only knows what was in its training data, frozen at a cutoff date, blended into weights you cannot inspect. Ask it about your company's refund policy, last week's incident postmortem, or a private API and it will either refuse or — worse — confidently invent something plausible. Retrieval-Augmented Generation (RAG) fixes this by changing the question from "what does the model remember?" to "what does the model say given this evidence I just handed it?" You retrieve relevant text at query time and paste it into the prompt, so the model reasons over fresh, authoritative, inspectable context.

This is the single most reliable lever for reducing hallucination on knowledge-intensive tasks. It is also why RAG is technique #15 and not technique #1: it sits at the boundary between prompt engineering and system design. The retrieval pipeline is engineering; how you frame the retrieved text in the prompt is prompt engineering — and that framing is where most teams leave accuracy on the table.

How it works

The minimal loop has three steps. Index: split your corpus into chunks, embed each chunk into a vector, store the vectors. Retrieve: embed the user's query, fetch the top-k nearest chunks. Generate: assemble those chunks into the prompt and instruct the model to answer using only them. The prompt template is the part you own:

You are a support assistant. Answer the question using ONLY the
context below. Each source is tagged with an ID.

If the context does not contain the answer, reply exactly:
"I don't have that information." Do not use prior knowledge.

Cite the source ID in brackets after each claim, e.g. [doc-3].

<context>
[doc-1] Refunds are processed within 5 business days of approval.
[doc-2] Refund requests must be filed within 30 days of purchase.
[doc-7] Digital goods are non-refundable once downloaded.
</context>

Question: Can I get a refund on an ebook I downloaded last week?

A grounded model answers: "No — digital goods are non-refundable once downloaded [doc-7]." That answer is checkable. A reviewer can open doc-7 and confirm it. That auditability — not just the accuracy — is the real prize.

Chunking and relevance

Retrieval quality dominates output quality: if the right chunk never reaches the prompt, no amount of clever instruction recovers it. Two knobs matter most.

Chunk size. Too large and a chunk mixes several topics, diluting its embedding and burning context budget. Too small and it loses the surrounding context needed to be meaningful. Splitting on semantic boundaries — paragraphs, sections, list items — beats splitting on a fixed character count. For prose, a few hundred tokens with modest overlap between adjacent chunks is a sane starting point you then tune empirically.
Relevance and k. More chunks is not better. Padding the prompt with marginally-relevant passages introduces distractors the model may latch onto, and pushes the genuinely useful chunk toward the middle, where models attend to it less reliably. Retrieve a handful of strong chunks rather than twenty weak ones, and consider a reranking step that re-scores candidates against the query before they enter the prompt.

Grounding the generation

Retrieval gets the right text into the prompt; grounding instructions get the model to actually use it. Three instructions carry most of the weight:

Restrict the source. "Answer using only the context above" measurably shifts the model away from its parametric memory.
License abstention. Explicitly permit "I don't have that information." Without it, a model asked an unanswerable question will fill the gap with invention. The empirical prompt-engineering literature is consistent that giving the model an out reduces fabrication.
Demand citations. Requiring an inline source ID per claim does double duty: it makes answers verifiable, and the act of citing nudges the model to stay anchored to the supplied text.

Pitfalls

Garbage in, confident garbage out. RAG does not verify your corpus. If the indexed documents are stale or contradictory, the model will faithfully ground its answer in the wrong fact. Curation is part of the system.

Citations can be fabricated too. A model asked to cite will sometimes invent a plausible-looking ID or attribute a claim to a chunk that does not support it. Citations improve verifiability; they are not proof. For high-stakes use, programmatically check that each cited ID exists and ideally that the claim is entailed by it.

The model overrides the context. When retrieved evidence contradicts what the model "believes," it sometimes sides with its training. Strong grounding language helps but does not fully eliminate this; test it deliberately with cases where the correct answer is one your model would otherwise get wrong.

Retrieval is the bottleneck, not the prompt. Teams often tune the generation prompt for days while the retriever silently fails to surface the relevant chunk. Evaluate the two stages separately: measure retrieval (did the right chunk make it into context?) before you measure generation. Most "the model hallucinated" bugs are actually "the evidence was never retrieved" bugs.

Lost in the middle. When you do pass several chunks, order them so the strongest candidates sit at the start or end of the context block rather than buried in the center, and keep the total tight.

RAG turns "trust the model" into "trust the model plus the evidence it shows its work against." That shift — from unfalsifiable to checkable — is the whole point.

Support assistant over a policy corpus

Why it's better: The bad prompt dumps the entire corpus (no retrieval), so the relevant fact is buried among distractors and the model has no instruction to ground, abstain, or cite. The good prompt passes a few retrieved, ID-tagged chunks and gives explicit grounding, abstention, and citation rules — producing a verifiable answer: "Within 5 business days of approval [a-12]."

Engineering Q&A over internal docs

Why it's better: The bad prompt invites the model to answer a private-infrastructure question from training data it cannot possibly have, guaranteeing a confident hallucination. The good prompt grounds the answer in retrieved internal chunks, forbids outside knowledge, and requires citations, so the response is both correct and auditable against the actual source docs.