engineering notes

RAG system architecture: sources, indexing, reranking

How a RAG system is built (retrieval augmented generation): sources, indexing, hybrid search, reranking and delivering the minimally sufficient context.

In brief for executives. When a RAG system is drawn on one slide, it is shown as a single arrow: “knowledge base → model → answer”. That picture is the source of most production problems. The real RAG architecture is a pipeline of several layers, and the system’s cost and reliability are set by them, not by model size. Understanding how it is built is needed not to write code but to ask the contractor the right questions and to understand what you pay for.

RAG (retrieval augmented generation) is an approach where the model answers not “from memory” but from fragments of your documents found for a specific question. The idea is simple, and because of that simplicity its scheme is usually drawn as one arrow. Nothing breaks on that arrow in a demo and almost everything breaks in production. Let’s go through what actually happens between the “base” and the “answer” — layer by layer.

RAG is not one arrow from base to model — it is a context pipeline.

Hypothesis: RAG is context infrastructure, not “search plus LLM”

It helps to swap the mental picture immediately. RAG is not “a model that was allowed to search a base”. It is infrastructure that, for each question, assembles the minimally sufficient, fresh and user-permitted context and only then hands it to the model. The model here is the last and most replaceable layer. Everything that determines whether the system lies sits before it.

Problem: the scheme is simplified to one arrow

The simplified “documents → vector → model” scheme drops exactly the layers responsible for reliability: bringing sources to one shape, updating the index on document change, selecting from the similar the one that actually answers, limiting context volume. Each omission seems a detail; together they are the difference between a demo and a system.

Why the usual approaches don’t work

The naive architecture rests on three implicit assumptions, each wrong in production. “Similar in meaning = correct”: vector search finds the semantically close, but an active and a revoked regulation are semantically almost identical. “The index matches reality”: if the index updates on a schedule, there is a window between a document changing and entering the index in which the system answers from the stale one. “More context is better”: feeding all found fragments increases request length (hence cost) and makes the model mix sources. Quality falls where growth was expected.

Engineering model: the layers of the architecture

Describe the pipeline as it really is.

Source and normalization layer. Different formats and systems are brought to uniform fragments with metadata: where from, which version, change date, who may access. This is the foundation: without metadata neither stale cut-off nor access control is possible.

Event-based indexing layer. A source change emits a reindex event for that exact document. The index lags reality by seconds, not a day.

Search layer: hybrid. Semantic search (by meaning) and lexical search (exact terms, numbers, SKUs) run in parallel; results are merged — this covers both “asked in own words” and “need clause 4.2”.

Reranking layer. A separate ranker model selects, from dozens of candidates, the few that actually answer the question, accounting for freshness and rights. This is the layer absent from the simplified scheme and the one that most affects accuracy.

data

Retrieval quality by pipeline layer (nDCG@10, BEIR, illustrative)

Each pipeline layer adds quality measurably. The «one arrow» scheme has none of these layers — which is why it breaks in production. Values are illustrative, within BEIR ranges.

Source: BEIR: A Heterogeneous Benchmark for Zero-shot IR, NeurIPS, 2021 https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/65b9eea6e1cc6bb9f0cd2a47751a186f-Paper-round2.pdf

Context-budget layer. A reranked minimum within a per-request token limit goes to the model. Context doesn’t silently grow, sources don’t mix.

Generation and evaluation layer. The model forms an answer from the provided context; in parallel, for part of the traffic, grounding is evaluated. The model here is picked per task and replaceable: the layer’s contract doesn’t change when the model changes.

data

Inference price at GPT-3.5 level (per 1M tokens)

$20.00

November 2022

$0.07

October 2024

×280

price drop in roughly 18 months

The model's own price is collapsing — so system cost is set not by it, but by the architecture around it: context length, number of calls, routing.

Source: Stanford HAI, AI Index Report 2025 https://hai.stanford.edu/ai-index/2025-ai-index-report

Practical takeaway for business

The architecture is what you buy, not the model. If a contractor shows a one-arrow scheme, that is a signal: the layers responsible for reliability are either not built or not understood. Ask to be shown where event-based updates, reranking and the context budget are.

Cost and quality live in the same layers. Context length is at once about money and source mixing; event-based indexing is at once about freshness and trust. So “make it cheaper” and “make it more accurate” here are often the same engineering step, not a trade-off.

Swapping the model should not be a project. In a correct architecture the model is a swappable layer behind a stable contract; if changing the model requires rewriting the system, the architecture is wrong, and that is future cost.

Apply this to your processes — .

Open questions

How many layers are justified for a specific case is a matter of calculation, not dogma: for a rarely changing base of a hundred documents some layers are redundant, for a live corporate flow they are mandatory. How to balance context completeness and cost is an open trade-off, resolved by measurement on real questions, not a general rule. How to measure ranking quality on your data without a labeled gold set is a problem without a ready answer; we build the gold set from historical requests, but that is an approximation.

If you are shown RAG as “one arrow” while the data changes every day, it is worth dissecting the architecture by layers before development starts. — we’ll look at sources, change frequency and where access boundaries run.