engineering notes

Testing an agent against real dialogs

Why synthetic scenarios are useless for chatbot and AI-agent testing, and how a corpus of real dialogs becomes the source of truth and a company asset.

Executive summary. Most AI agents are tested on scenarios the developers themselves invented — which means they test the team’s imagination, not the agent. A real customer holds a conversation in a completely different way, so such an agent passes its tests and fails in production. The source of truth isn’t invented scenarios but your actual chats, calls, and support tickets. A corpus of live dialogs is worth collecting and keeping as an asset: it’s exactly what shows where the agent loses the customer, and exactly what prevents repeating a failure you’ve already lived through.

Chatbot testing almost always starts the same way: the team sits down and writes scenarios. “The user asked the price — the bot answered.” “The user clarified the dates — the bot offered options.” The scenarios are tidy, logical, they pass green. And then the agent meets a live person, and it turns out no real dialog looks anything like those scenarios.

A synthetic scenario tests not the agent but the developer’s idea of how a customer behaves.

Hypothesis: the source of truth is real dialogs, not invented scenarios

A scenario written by a developer carries one hidden assumption: that the customer reasons the same way the scenario’s author does. But the author knows how the agent works, knows the “right” phrasing of the question, and unconsciously writes exactly that — clear, contradiction-free, down a single branch. A real person almost never does this. They ask half a question, drift off topic, come back, confuse their own data, frame things through emotion rather than fact. So the only honest source of tests is dialogs that have already happened: chat transcripts, call recordings, support tickets. They aren’t invented — so they carry no built-in assumption that the agent will have it easy.

From this follows a simple but inconvenient rule. If you want to know how the agent will behave with customers, test it on how customers have already talked to you. Everything else is a test of the team’s hypothesis about the customer, not a test of the agent.

The trouble with synthetic scenarios isn’t that they’re bad — it’s that they systematically miss reality, and miss it predictably. A developer writes what they can imagine; they can only imagine situations familiar to them. As a result, an entire class of real but “non-obvious” dialogs never makes it into the tests at all: the customer who arrived with someone else’s problem; the customer who changed their request three times; the customer who seemed to agree, then changed their mind. These dialogs aren’t covered — not because they were deemed unimportant, but because no one thought them up.

Then a dangerous mechanism kicks in. The report is green, there are many tests, coverage looks high — and management reads this as “the agent is ready.” But coverage here is deceptive: the invented branches are covered, not the real ones. The agent ships to production with whole layers of untested behavior, and those layers are found by the customer, not the team — at the moment when it’s already too late. This is the same demo-to-production gap we write about in GREEN bias: a beautiful quality report on top of poor real quality.

Why the usual approaches don’t work

The first reflex — “let’s write more scenarios” — doesn’t solve the problem, it makes it worse. More scenarios from the same team means more of the same assumptions: the branches multiply, but they’re all still from the developer’s head, and the blind spots stay blind. Volume grows; reality doesn’t.

The second approach — “let’s have the model itself generate test dialogs” — looks like a way out, but reproduces the same error at a new level. An LLM generates plausible, smooth, cooperative dialogs, because that’s what it was trained to do. The generated “customer” helps the agent, answers to the point, doesn’t stumble. You get synthetics disguised as reality — and it tests the agent even more gently than hand-written scenarios. Why LLMs play the customer role badly is a big topic in its own right.

The third approach — “run the scenario once and lock in the result” — ignores the nature of LLMs. The same input gives the model different answers, so a single green run proves nothing: the next run of the same scenario may turn red. Without statistics across many runs, a “passed” checkmark isn’t a measurement, it’s luck. All three approaches share the same root: the test stays detached from how customers actually talk.

Engineering model: a corpus of live dialogs as a test asset

Honest chatbot testing is built not around scenarios but around a corpus of real dialogs. In practice this is several steps, and each one is engineering work, not a one-off export.

First — collect the corpus. Take real dialogs from every channel: website chat, messengers, call transcripts, support tickets. The broader the channel coverage, the fewer the blind spots. Data discipline matters here: personal data is anonymized, consent for use is respected, sensitive fields are stripped — the corpus must be legally clean from day one, or it can’t be used at all.

Second — annotate it. A raw dialog isn’t yet a test: you need to record how the conversation was supposed to end (the customer placed an order, got an exact answer, was correctly handed off to a human) and where the real dialog had its difficulty — an objection, a topic switch, a contradiction in the data. Annotation turns an archive of transcripts into a set of checkable outcomes.

Third — turn the corpus into a run. The customer’s real lines are fed to the agent, and what’s evaluated is not a match against a reference text but whether the recorded outcome was reached: did the customer get to the goal or not. This is essential — we test the ability to drive to a result, not word-for-word answer-guessing. And each dialog is run many times, because the model’s non-determinism can’t be caught any other way.

Fourth — close the loop. Every real failure found on the corpus becomes a permanent test. The agent broke on a specific dialog — that dialog stays in the set forever, and any future version must pass it. This way the corpus doesn’t just test, it accumulates memory of failures lived through — the basis of an approach where maturity is measured by the quality of red tests survived, not the count of green ones. To keep the corpus alive, it needs regular replenishment with fresh production dialogs — otherwise it ages along with how customers, products, and channels change. That’s why collecting and storing dialogs is worth building into the operational AI environment as a standing process, not a one-off task.

A corpus of real dialogs isn’t an archive of transcripts — it’s an asset: it shows where the agent loses the customer and prevents repeating a failure already lived through.

This corpus has one more property that’s easy to miss: it’s valuable in itself. It contains real customer objections, real phrasings of need, real points where the conversation falls apart. That’s material not just for testing the agent, but for the product, marketing, and training people. A company that collects dialogs systematically gains an asset a competitor can’t reproduce from scratch.

Practical takeaway for the business

Demand tests on real dialogs, not invented ones. The key acceptance question for an AI agent isn’t “how many scenarios does it pass” but “what corpus of live dialogs was it tested on, and how many real failures were found there.” If the agent was tested only on scenarios from the team’s head, you have no data on how it will behave with customers — only data on how the team imagines customers.

What to delegate. Collecting and storing the corpus of real dialogs is a separate task with an owner, not a by-product of development. To start, export the transcripts of one or two channels, anonymize them, and annotate outcomes; even a small real corpus catches more than hundreds of synthetic scenarios. Where this already worked: in a project for a large travel retailer under NDA, the first version of the dialog assistant was tested on tidy scenarios — it passed them almost perfectly. When the tests were rebuilt on a corpus of real inquiries, whole classes of dialogs surfaced in which the agent lost the customer: no one had invented them, because they looked “illogical.” What got fixed was no longer the test score but the conversation itself.

What not to do. Don’t accept an agent on synthetic tests if it will work with live people. Don’t substitute model-generated dialogs for a real corpus — that’s the same synthetics, just invisible. And don’t treat collecting dialogs as a one-off: a corpus that isn’t replenished ages and reopens the blind spots.

Collect a corpus and test your agent on it — .

Open questions

How many dialogs are enough for an honest corpus depends on the diversity of your customers and channels, not on a round number; the benchmark is covering the main types of inquiries, not amassing volume for its own sake. How to anonymize data without losing the substance of the dialog is a separate engineering and legal task, best done before collection rather than after. And who owns the corpus as an asset — product, support, or a dedicated data function — is a management question, better settled before the dialogs start piling up with no owner. More broadly, this touches the limits of what AI actually knows about your data and the structural problems of complex agentic systems.

If your AI agent was tested on scenarios the team itself wrote, you don’t yet know how it will behave with real customers. — let’s see how to build a corpus of your dialogs and what it reveals about the agent. Related work is in our cases and on the AI systems development page.