research
Context engineering: why RAG breaks in production
Why knowledge-retrieval systems look perfect in a demo and degrade within a month, and which engineering decisions a working RAG is actually made of.
RAG almost always looks good in a demo. You take a dozen documents, ask questions those documents directly answer, get accurate quotes. The decision is made on that demo. A month later the same system in production answers confidently and wrong, and nobody can explain why. It is not the model — it is the context, and context is almost never engineered as a system.
“A vector DB with a prompt” is a prototype
The common mental model: put documents in a vector DB, retrieve similar chunks on a query, paste them into the prompt. It works in the demo precisely because demo data is small, fresh and pre-selected to match the questions. In production every one of those assumptions breaks.
The index goes stale. Documents change; the index is rebuilt weekly on a schedule or by hand. The system answers from the old policy, formally confident. The user does not see the answer is stale — they see a confident answer.
Retrieval is noisy. Pure vector similarity is good at “about this in general” but poor at telling “exactly this case” from “similar but different”. Without a lexical component and re-ranking, similar-but-wrong fragments land in context — and the model dutifully answers from them.
Context is unbounded. “To not lose quality” everything retrieved goes into the prompt. This hits cost and quality at once: on long noisy context the model extracts the essential worse than on short precise context.
What a working RAG is made of
Event-driven re-indexing. The index rebuilds on a source-change event, not on a schedule. Source changed — the relevant part of the index updated. This removes an entire class of “confidently stale” answers that is otherwise impossible to debug because it is not reproducible.
Hybrid retrieval and re-ranking. Dense (vector) search finds candidates by meaning, lexical search cuts “looks like it but isn’t”, re-ranking orders by relevance to the specific query. What goes into context is not top-N by cosine similarity but a re-ranked minimum.
Context budget. A token cap per step. This forces the system to decide “what of the retrieved is actually needed” instead of offloading that to the model by feeding it everything.
Quality evaluation. Without it RAG degrades invisibly: relevance and groundedness metrics (how much the answer relies on the supplied context vs. the model’s memory), regression sets of questions with known answers, tracing of which fragment influenced the answer. If you can’t answer “why did the system answer that yesterday”, you fundamentally cannot debug it.
Where the line usually is
A few field observations:
- If demo quality is excellent but production “sometimes lies”, the culprit is almost always not the retrieval algorithm but index freshness and missing re-ranking. Check that first, don’t swap the model.
- “Add more context” more often hurts than helps. Answer accuracy depends more on how relevant the context is than on how complete it is.
- Groundedness beats fluency. An answer that honestly says “the sources don’t contain this” is more useful than a confident guess — but that behavior must be deliberately designed and tested; it does not appear on its own.
Conclusion
Context engineering is a distinct discipline: source normalization and versioning, event-driven indexing, hybrid retrieval with re-ranking, context budgets and continuous quality evaluation. Answer quality is determined by which context, and how much of it, reached the model — not by the size of the model. Teams that don’t build this don’t get “bad RAG” — they get RAG that degrades invisibly and can’t be trusted exactly when it matters most.