engineering notes

How to build a RAG system that doesn't lie in production

A practical breakdown of building a RAG system: sources, event-based indexing, hybrid search with reranking, and grounding evaluation.

In brief for executives. RAG is a way to make the model answer from your data, not from what it “knows”. A demo on a dozen documents almost always looks perfect and tells you almost nothing about production. In real operation a RAG system does not “not know” — it confidently answers incorrectly, and for the business that is more expensive than an honest “no answer”. The difference between a demo and a working system is not the model but context engineering: how sources are normalized, how they are updated, how the minimally sufficient fragment is retrieved, and how it is checked that the answer is grounded. That is what to require on a pilot and what cost is measured against.

The scenario repeats project after project. A knowledge base is taken, documents are loaded into a vector store, a model is connected — in the demo the assistant confidently answers questions, everyone likes it. Two months later, in production, it just as confidently tells a client an outdated contract term, cites a revoked regulation, and merges two different products into one answer. Formally the system “works”: it always answers something. That is exactly the problem.

RAG (retrieval augmented generation) was conceived as a way to remove fabrication: the model answers not from memory but from a retrieved fragment. But RAG by itself does not remove hallucinations — it moves them from the model level to the retrieval level. If the wrong fragment is found or the fragment is stale, the model will neatly and convincingly retell the wrong thing.

A confident error costs more than an honest “I don’t know”.

Hypothesis: RAG quality is set by context engineering, not the model

It is commonly thought that answer quality depends on how strong the model behind RAG is. In practice the decisive factor is which exact fragment, and in what volume, made it into the context. A strong model on a wrong fragment gives a nicely formatted error; a weak model on a precise fragment gives a useful answer.

data

Answer accuracy: model alone vs the same model with RAG

What decides accuracy is retrieval, not model size: on the same model, adding RAG multiplies accuracy. Answer quality is set by which context reached the model.

Source: Exploring RAG Solutions to Reduce Hallucinations in LLMs, IEEE, 2024 https://ieeexplore.ieee.org/document/11014810/

It follows that building a RAG system is mostly building a context pipeline, not picking a model.

Problem: the demo works on static, production on change and volume

The demo has three conveniences absent in production. Few documents, so almost any search returns something relevant. Documents don’t change, so the answer doesn’t go stale. Questions are asked “correctly”, close to the source text.

In production it is all different. Thousands of sources, including contradicting versions. Documents change daily: a contract re-signed, a regulation revoked, a price updated. Questions are asked in the user’s own words, far from the document’s wording. Under these conditions the naive “vector → model” scheme starts confidently erring, and worst of all the error is invisible: the system doesn’t crash, it just sometimes lies.

Why the usual approaches don’t work

“Load documents into a vector DB” is not a knowledge system but its simplest piece. Vector search finds the semantically similar but does not distinguish fresh from stale, exact from approximate, allowed for this user or not. It will return a similar old regulation with the same confidence as the active one.

Increasing model size doesn’t solve it: the model doesn’t know the given fragment is stale or irrelevant. Stuffing in “all retrieved fragments just in case” only makes it worse — context length (and cost) grows and the model starts mixing sources. The root is that the scheme lacks three things: freshness (when the fragment was updated), ranking (which of the similar ones actually answers), and grounding checks (does the answer rely on the provided text).

Engineering model: a pipeline, not “DB plus model”

A working RAG system is a pipeline with explicit layers.

Source normalization. Different formats are brought to one shape; each fragment has metadata: source, version, date, access rights. Without it you can neither cut off the stale nor restrict output by user.

Event-based indexing, not scheduled. The index updates when a source changes (a re-signed contract, a revoked regulation), not in a nightly “batch”. A stale answer is almost always an index lagging behind reality.

Hybrid search with reranking. Vector search yields semantic candidates, lexical search exact term and number matches; then a separate reranking step selects, from the candidates, those that actually answer the question. This is the step absent from the naive scheme and the one that most affects “lies / doesn’t lie”.

data

What a reranking step (cross-encoder) buys you

+25–48%

retrieval-quality gain from reranking (depending on baseline and domain)

+4 nDCG

cross-encoder advantage over a strong bi-encoder, BEIR average

Reranking is the layer missing from the naive «vector → model» scheme — and the one that most shifts results from «wrong» to «right».

Source: BEIR benchmark; исследования cross-encoder reranking, 2022–2024 https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/65b9eea6e1cc6bb9f0cd2a47751a186f-Paper-round2.pdf

Context budget. The model is fed not “everything found” but a reranked minimum within a set token limit. This is both about quality (less source mixing) and cost (context doesn’t silently grow).

Grounding evaluation. For part of the traffic it is automatically checked whether the answer relies on the provided fragments and whether there are claims “out of thin air”. This turns “seems to work” into a measurable quantity.

Practical takeaway for business

A demo on ten documents cannot be taken as proof. On the pilot, ask three things: how the system knows a document is stale; what happens when what’s found is insufficient (an honest “I don’t know” or an answer anyway); how the share of grounded answers is measured. The answers predict production behaviour better than any demo.

The price of an error defines the architecture, not the other way round. Where a wrong answer is a legal or financial risk, the absence of an answer is cheaper than a confident error; the system must be able to stay silent. This is a design decision made before development.

RAG cost is predictable if context budgets and event-based indexing are designed in. It becomes unpredictable where “just in case” everything found is stuffed in and everything is reindexed indiscriminately.

Apply this to your processes — .

Open questions

How to measure answer “truthfulness” in production without human labeling is a problem without a mature general solution; automated grounding scores approximate but do not replace spot human checks. Where the line is between “the system honestly doesn’t know” and “the system must find it” is a business decision, not a model one, and it changes process to process. How fresh the index must be is a trade-off between reindexing cost and the price of a stale answer; there is no universal answer, only a calculation for a specific change flow.

If you have a knowledge base people answer from by hand, and the price of an error in the answer is high — that is a candidate for RAG built as a system, not as a demo. — we’ll go through sources, change frequency and where the “honestly don’t know” boundary lies.