engineering notes

Context entropy and the degradation of answer quality

How noise accumulating in context lowers an AI system's answer quality, and which engineering techniques hold it back.

In brief for executives. An AI system’s answer quality degrades not abruptly but imperceptibly — as noise accumulates in the context. This is an operational risk, not a model bug: the system keeps answering confidently, just increasingly inaccurately. The degradation cannot be “fixed by a model update” — it is prevented by engineering, by managing context. For the business this means: quality predictability is a consequence of architecture, not luck.

AI systems have an unpleasant property: they rarely break visibly. More often they slowly “drift” — answers become ever less accurate but just as confident. Behind this is noise accumulating in the context; call it context entropy.

Quality doesn’t drop at once — it quietly drifts with the context.

Hypothesis: noise in context grows over time

The longer a dialogue, session or process lives, the more irrelevant material enters the context: stale chunks of history, extra documents, intermediate reasoning. The useful signal is diluted. The model still answers, but the signal-to-noise ratio falls, and with it accuracy.

Problem: the degradation is invisible

Unlike a service outage, growing context entropy raises no error. There is no alert “quality dropped 12%”. The user sees slightly less accurate answers, writes it off as “AI sometimes errs”, while the system has drifted away from working quality. The problem is noticed late — by complaints or by an incident.

data

Answer accuracy by position of the needed fact in a long context

Models use long context unevenly: what lands in the middle gets lost. «More context» without managed delivery lowers accuracy, not raises it. Values illustrative; the profile is from the study.

Source: Lost in the Middle: How Language Models Use Long Contexts (Liu et al.), 2023 https://arxiv.org/abs/2307.03172

The mechanism is clear: the longer and noisier the context, the worse the model uses what’s in the middle. Accumulating context literally works against accuracy.

Why the usual approaches don’t work

“Add more context for reliability” accelerates degradation: more input means more noise and a stronger middle effect.

“Take a stronger model” doesn’t help: a strong model equally doesn’t know that half of the context fed to it is irrelevant.

data

Model context windows grow roughly 30× per year

4K → 1M+

tokens of context window: early 2023 → 2025

≈30×/yr

growth rate of context length since mid-2023

The window grows faster than the ability to use it: you can fit almost anything, but accuracy depends on what is put there and how. Window size is no substitute for context engineering.

Source: Epoch AI, анализ длины контекста https://epoch.ai/data-insights/context-windows

The window grows by orders of magnitude, but that only increases how much noise can be accumulated — not the model’s ability to ignore it.

“Restart the session manually when we notice” doesn’t work as a strategy: the degradation is precisely what isn’t noticed in time — that is its nature.

Engineering model: how to hold back entropy

Active context cleanup. Context is not accumulated but reassembled per step: the irrelevant is dropped, not “kept just in case”.

Compression without meaning loss. A long history is folded into a compact state (what matters, what’s decided), not dragged verbatim.

Event-based reindexing. Sources update on change so the stale, semantically similar to the current, doesn’t enter the context.

A context budget per step. A hard ceiling stops length from silently growing — and limits noise accumulation.

Quality measurement on the flow. For part of the traffic, answer grounding is automatically evaluated. This turns invisible degradation into an observable metric you can react to in advance.

Practical takeaway for business

Quality degradation is a managed risk, not a property of AI. Ask how the system measures quality in production and what it has for context budget and cleanup. If there’s no “we measure this” answer — quality is not controlled, and you’ll learn of it by complaints.

Design quality measurement in from the start. Cost is a few days of work if done in advance; weeks of post-incident analysis if not. The difference is not technology but whether quality was treated as an observable quantity.

Apply this to your processes — .

Open questions

How to measure “context entropy” directly, not by proxy signs, has no mature standard. Where the boundary is between useful memory and noise depends on the process and is resolved by measurement. How much more robust new models are to noisy context — there is improvement, but it doesn’t cancel context management.

If your system’s answers have “drifted” but there are no explicit errors — it is almost certainly context entropy. — we’ll look at how to measure quality and where noise accumulates.