Retrieval-Augmented Prompting (RAG)
Ground the model in retrieved evidence so it answers from your data instead of its training-time memory.
Why it matters
A model only knows what was in its training data, frozen at a cutoff date, blended into weights you cannot inspect. Ask it about your company's refund policy, last week's incident postmortem, or a private API and it will either refuse or — worse — confidently invent something plausible. Retrieval-Augmented Generation (RAG) fixes this by changing the question from "what does the model remember?" to "what does the model say given this evidence I just handed it?" You retrieve relevant text at query time and paste it into the prompt, so the model reasons over fresh, authoritative, inspectable context.
This is the single most reliable lever for reducing hallucination on knowledge-intensive tasks. It is also why RAG is technique #15 and not technique #1: it sits at the boundary between prompt engineering and system design. The retrieval pipeline is engineering; how you frame the retrieved text in the prompt is prompt engineering — and that framing is where most teams leave accuracy on the table.
How it works
The minimal loop has three steps. Index: split your corpus into chunks, embed each chunk into a vector, store the vectors. Retrieve: embed the user's query, fetch the top-k nearest chunks. Generate: assemble those chunks into the prompt and instruct the model to answer using only them. The prompt template is the part you own:
You are a support assistant. Answer the question using ONLY the
context below. Each source is tagged with an ID.
If the context does not contain the answer, reply exactly:
"I don't have that information." Do not use prior knowledge.
Cite the source ID in brackets after each claim, e.g. [doc-3].
<context>
[doc-1] Refunds are processed within 5 business days of approval.
[doc-2] Refund requests must be filed within 30 days of purchase.
[doc-7] Digital goods are non-refundable once downloaded.
</context>
Question: Can I get a refund on an ebook I downloaded last week?
A grounded model answers: "No — digital goods are non-refundable once downloaded [doc-7]." That answer is checkable. A reviewer can open doc-7 and confirm it. That auditability — not just the accuracy — is the real prize.
Chunking and relevance
Retrieval quality dominates output quality: if the right chunk never reaches the prompt, no amount of clever instruction recovers it. Two knobs matter most.
- Chunk size. Too large and a chunk mixes several topics, diluting its embedding and burning context budget. Too small and it loses the surrounding context needed to be meaningful. Splitting on semantic boundaries — paragraphs, sections, list items — beats splitting on a fixed character count. For prose, a few hundred tokens with modest overlap between adjacent chunks is a sane starting point you then tune empirically.
- Relevance and k. More chunks is not better. Padding the prompt with marginally-relevant passages introduces distractors the model may latch onto, and pushes the genuinely useful chunk toward the middle, where models attend to it less reliably. Retrieve a handful of strong chunks rather than twenty weak ones, and consider a reranking step that re-scores candidates against the query before they enter the prompt.
Grounding the generation
Retrieval gets the right text into the prompt; grounding instructions get the model to actually use it. Three instructions carry most of the weight:
- Restrict the source. "Answer using only the context above" measurably shifts the model away from its parametric memory.
- License abstention. Explicitly permit "I don't have that information." Without it, a model asked an unanswerable question will fill the gap with invention. The empirical prompt-engineering literature is consistent that giving the model an out reduces fabrication.
- Demand citations. Requiring an inline source ID per claim does double duty: it makes answers verifiable, and the act of citing nudges the model to stay anchored to the supplied text.
Pitfalls
Garbage in, confident garbage out. RAG does not verify your corpus. If the indexed documents are stale or contradictory, the model will faithfully ground its answer in the wrong fact. Curation is part of the system.
Citations can be fabricated too. A model asked to cite will sometimes invent a plausible-looking ID or attribute a claim to a chunk that does not support it. Citations improve verifiability; they are not proof. For high-stakes use, programmatically check that each cited ID exists and ideally that the claim is entailed by it.
The model overrides the context. When retrieved evidence contradicts what the model "believes," it sometimes sides with its training. Strong grounding language helps but does not fully eliminate this; test it deliberately with cases where the correct answer is one your model would otherwise get wrong.
Retrieval is the bottleneck, not the prompt. Teams often tune the generation prompt for days while the retriever silently fails to surface the relevant chunk. Evaluate the two stages separately: measure retrieval (did the right chunk make it into context?) before you measure generation. Most "the model hallucinated" bugs are actually "the evidence was never retrieved" bugs.
Lost in the middle. When you do pass several chunks, order them so the strongest candidates sit at the start or end of the context block rather than buried in the center, and keep the total tight.
RAG turns "trust the model" into "trust the model plus the evidence it shows its work against." That shift — from unfalsifiable to checkable — is the whole point.
Support assistant over a policy corpus
✕ Weaker
Here are some help articles: [pasted: 20 full articles, ~15k tokens] The user asks: how long do refunds take? Answer helpfully.
✓ Stronger
Answer the question using ONLY the sources below. Cite the source ID in brackets after each claim. If the sources don't contain the answer, reply exactly: "I don't have that information." <sources> [a-12] Approved refunds are credited within 5 business days. [a-31] Refund eligibility ends 30 days after purchase. </sources> Question: How long do refunds take once approved?
Why it's better: The bad prompt dumps the entire corpus (no retrieval), so the relevant fact is buried among distractors and the model has no instruction to ground, abstain, or cite. The good prompt passes a few retrieved, ID-tagged chunks and gives explicit grounding, abstention, and citation rules — producing a verifiable answer: "Within 5 business days of approval [a-12]."
Engineering Q&A over internal docs
✕ Weaker
Using your knowledge, what's the rate limit on our internal Payments API and how do I request an increase?
✓ Stronger
You are answering questions about internal services. Use ONLY the retrieved context. Do not use outside knowledge. After each statement, cite the chunk ID. If the answer isn't in the context, say you don't have it and suggest which doc to check. <context> [pay-svc-04] The Payments API allows 100 req/s per service account by default. [pay-svc-09] Rate-limit increases are requested via the #payments-platform channel with a justification. </context> Question: What's the default Payments API rate limit and how do I get it raised?
Why it's better: The bad prompt invites the model to answer a private-infrastructure question from training data it cannot possibly have, guaranteeing a confident hallucination. The good prompt grounds the answer in retrieved internal chunks, forbids outside knowledge, and requires citations, so the response is both correct and auditable against the actual source docs.
Key takeaways
- RAG changes the question from 'what does the model remember' to 'what does the model conclude from this evidence' — that's why it's the most reliable anti-hallucination lever for knowledge tasks.
- Retrieval quality caps output quality: if the right chunk never reaches the prompt, no instruction can save the answer. Evaluate retrieval separately from generation.
- Three grounding instructions do most of the work: restrict the source, explicitly license 'I don't know', and demand inline citations.
- More chunks is not better — distractors and 'lost in the middle' effects degrade accuracy. Prefer a few strong, reranked chunks over many weak ones.
- Citations improve verifiability but can themselves be fabricated; for high-stakes use, programmatically check that cited IDs exist and support the claim.
Further reading
- Schulhoff et al., "The Prompt Report" (Learn Prompting)
- Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks"
- Liu et al., "Lost in the Middle: How Language Models Use Long Contexts"
- Sander Schulhoff interview, Lenny's Podcast