engineering notes

A model of cognitive noise

An AI agent's quality is defined not by the right answer but by resilience to dialog noise: topic switches, contradictions, emotions, returns to old questions. How to redefine the quality metric.

Executive summary. It’s customary to measure a dialog assistant’s quality by whether it answers the question correctly. That metric deceives: almost any modern agent can give the right answer to a clean question; the difference between good and bad lies elsewhere. A real dialog is not a chain of clean questions but twenty minutes of noise: the person switches topics, contradicts themselves, misremembers what was said, adds constraints, gets nervous and returns to closed questions. A good agent is the one that doesn’t lose the goal inside that noise. If you measure the agent by answer accuracy, you’re measuring the wrong thing for the money.

When people evaluate a dialog agent, they ask out of habit: does it answer correctly? The metric seems obvious, and that’s why almost no one notices it doesn’t measure the main thing. Giving the right answer to a cleanly posed question is something practically any system on a good model can do today. If real dialogs consisted of clean questions, the quality problem would be solved. But they don’t. A real dialog is a stream of interference, and that’s exactly where agents break.

A good agent isn’t the one that gives the right answer — it’s the one that doesn’t lose the goal after twenty minutes of chaos.

Hypothesis: agent quality is resilience to noise, not answer accuracy

We claim that the “answer correctness” metric measures the least distinguishing characteristic of an agent. On a clean question a good and a bad agent are nearly indistinguishable — both answer correctly. The difference appears only under noise load: when the user derails the topic, contradicts themselves, refers to things they never said, adds a constraint midway, breaks into emotion and ten minutes later returns to a question presumed closed. An agent’s quality is its resilience to this stream of interference while holding the original goal. So the right metric is not “share of correct answers” but “share of dialogs in which the agent reached the goal despite the noise”.

Problem: a real dialog is noise, not a sequence of questions

It’s worth naming the noise by type, because teams usually treat each kind as a “user error” rather than the norm to design for.

Topic switch: midway the person asks about something else, then returns — or doesn’t. Contradiction: first “budget doesn’t matter”, five turns later “that’s too expensive”. False memory: “but you said there was a discount” — though the agent never said it. Out-of-order new constraints: everything’s chosen, and then “oh right, I don’t eat gluten / I’m flying with a dog / lower berth only”. Emotion: irritation, anxiety, haste that change not the facts but the tone and priorities. Return to closed: a question answered twenty minutes ago is asked again, as if for the first time.

These aren’t pathologies or malice — they’re the primary, normal form of human dialog. That’s how human thinking works: associative, non-linear and inconsistent. Yet teams design the agent for an idealized “question — answer — question” sequence, then wonder why it falls apart on live traffic. And in tests this fragility is invisible: scenarios are written by people who run the dialog cleanly, without noise, because they subconsciously know the “right” branch.

Why the usual approaches don’t work

The first familiar approach is to improve answer quality: a better model, sharper phrasing, a richer knowledge base. This raises the bar on a clean question and gives almost nothing under noise. The problem isn’t that the agent answers poorly but that it loses the thread: a correct answer to a correct question is useless if the agent has already forgotten what the dialog began for. Answer accuracy and noise resilience are different axes, and the first doesn’t pull the second along.

The second approach is to grow the context window: give the model the whole dialog history and it’ll hold the goal itself. It doesn’t. A long window is memory volume, not the ability to separate signal from interference. The longer and noisier the dialog, the more the original goal blurs among contradictions and digressions, and the model starts reacting to the last message instead of the overall task. Context is a managed resource, not a dump of history: without separating goal from noise, a large window merely drowns for longer.

The third approach is to script handling of each noise type as a separate rule: a branch for topic switch, a branch for contradiction, a branch for return. This is the same linear script that breaks against real combinatorics: noise types overlap (emotion plus a new constraint plus a false memory at once), and scripting all combinations in advance is impossible. Each new rule adds fragility, not resilience.

Engineering model: holding the goal as a separate function

The working model factors out goal-holding as a separate agent function, not reducible to answer quality. The loop rests on three things.

First — an explicit, durable model of the dialog goal. The agent holds not just the last message but a continuously updated picture: what this conversation is for, what stage it’s at, what’s already established. Noise updates this picture but doesn’t erase it. This is the agent’s primary working resource — not a message history but a structured task state that survives digressions.

Second — separating signal from interference. The agent classifies an incoming message not by content but by its role in the dialog: is it progress toward the goal, a digression, a contradiction to what was said earlier, or emotion. A contradiction it doesn’t silently file as a new fact but notices and reconciles; a return to a closed question it recognizes and doesn’t restart; emotion it handles as a tone signal, not a change of task. Without this classification, any noise shifts the agent on par with the signal.

Third — controlled return to the goal. After any digression the agent gently steers the dialog back to the original task without losing what valuable surfaced along the way: “back to the search — but I’ve noted your dates are firmer now”. That is visible resilience: the user makes noise, the agent holds the course. Technically it’s closer to an event-driven model where the agent reacts to the event type instead of running everyone through one pipe.

Crucially: this model doesn’t try to suppress noise or “train” the user to behave cleanly. It accepts noise as normal input and designs resilience to it — exactly as a reliable system is designed for unreliable data, not against it.

Practical takeaway for the business

Redefine the quality metric. If you evaluate the agent by the share of correct answers, you measure a characteristic on which good and bad agents barely differ, and you don’t measure the one the result depends on. The right metric is the share of dialogs brought to the goal under noise: with topic switches, contradictions, returns and emotions. That’s the proxy for revenue and satisfaction, not accuracy on a sterile question.

What to delegate. Build acceptance on noisy, not clean, scenarios: multi-turn dialogs with digressions, contradictions and returns to closed questions. The source is a corpus of real conversations, because genuine noise can’t be reliably invented; people break a dialog in ways a developer doesn’t imagine. The evaluation should be by a judge on outcome — did we reach the goal — not on the fact of correct replies.

What not to do. Don’t take “the bot answers questions well” as readiness for production — that checks the least distinguishing quality. Don’t fix fragility by growing the context window: memory volume doesn’t equal holding the goal. And don’t try to script each noise type as a separate rule — the combinatorics of interference will outrun any set of branches.

Apply this to your agent — .

Open questions

How to measure noise resilience as a single clear number — this is harder than the share of correct answers and is for now addressed by a set of scenarios with rising noise, not a single metric. Where the line runs between holding the goal and ignoring the user — too rigid a return to the task turns into deafness to the fact that the person genuinely changed their mind, and that edge is calibrated on live traffic. How much noise the agent must hold in a specific process — for a reference contact the bar is lower, for a long sale or account management much higher, and it’s settled before launch based on the cost of error.

If your assistant answers questions confidently in a demo but “drifts” in real long dialogs, you’ve been measuring the wrong characteristic. — let’s see on what noise your agent loses the goal.