engineering notes

Agent versus agent: a new model of QA

Why the tester of a conversational AI is another agent, not a human. The «Customer Simulator → Target Agent → Judge» architecture as multi-agent engineering applied to QA.

Executive summary. A conversational AI agent can’t be tested enough by hand: real customer-behavior variants number in the thousands, while a human will check dozens. So the tester becomes another agent. The working model is three roles: a customer simulator that tries to spoil everything, the target agent under test, and a judge that evaluates the outcome. This isn’t exotic — it’s a direct application of multi-agent engineering to quality control. For the business it delivers a scale of testing that manual QA can’t reach in principle, and moves the cost of found defects to before launch.

With ordinary code, a human writes the tests: inputs are finite, behavior is deterministic, hundreds of cases cover almost everything. With a conversational agent that doesn’t work. The space of what a live customer might say is practically infinite, and each run with an LLM goes slightly differently. A human physically can’t check enough. So a shift happens in conversational QA: the tester becomes not a human but another agent.

If user behavior is infinite and a human checks dozens of cases, the tester has to be made an agent.

Hypothesis: the tester of a conversational agent is another agent

The thesis is direct: the reliability of a conversational AI agent can’t be established by manual testing, because the scale doesn’t add up. To catch a rare but expensive failure — the one where the agent loses the goal after twenty minutes of chaos or confidently tells the customer something wrong — you need thousands of varied dialogs under pressure. A human will run dozens. The gap isn’t in human quality but in order of magnitude.

Only an automated opponent can close that gap — an agent that itself runs a dialog against your agent, in thousands of variants, with different behavior models, reproducibly and around the clock. Conversational testing stops being a manual discipline and becomes a multi-agent task: one agent checks another, a third judges. This isn’t futurism — it’s the engineering answer to the simple arithmetic of coverage.

Problem: manual QA doesn’t scale, and a plain autotest misses the point

A team that has realized this usually has two failed approaches behind it.

The first — manual testing. A QA engineer sits down and runs dialogs by hand. The approach is honest, but it hits two walls. The wall of volume: dozens of checked dialogs against the thousands needed isn’t coverage but a sample that rare failures simply don’t fall into. The wall of reproducibility: a human won’t repeat the same dialog twice identically, and because of LLM non-determinism a single run proves nothing — you need statistics across many repeats of the same pressure.

The second — a classic autotest. Fix the utterance, fix the expected answer, compare. For deterministic code that works; for dialog it’s almost useless. The agent’s answer is phrased differently every time, and literal string comparison catches noise, not substance. More importantly, such a test checks one utterance, whereas dialog breaks on coherence: at the second topic turn, on a return to a closed question, on a contradiction between the third and tenth utterance. A single-utterance autotest can’t see this at all.

Both approaches are an attempt to check a conversational system with tools meant for static code. Dialog is a process, not a function; it has to be checked with a process too.

Why the usual approaches don’t work

Suppose the team decides to combine them: take an LLM and have it play the customer, to get both scale and liveness. Logical — and here the main trap is hidden.

A naive customer simulator is too reasonable. Within two or three turns it stops getting in the way and starts helping the target agent: answers helpfully, clarifies politely, leads the dialog to a close. That happens because language models are trained to be cooperative and to bring a conversation to a result — that’s their base behavior. The simulator then reproduces the very mistake it was created to eliminate: it checks how the agent works when it’s being helped. And agents break on those who don’t help.

The second problem is evaluation. Even if the simulator pushes properly, who decides whether the dialog passed or not? Comparison against a reference answer won’t do: there are many correct phrasings. Checking “did the agent call the right function” is deceptive: the internal call can be correct while the customer still walked away empty-handed. You need a separate evaluator that looks at the outcome for the user, not at the internals.

And the third — without the right source of scenarios, even a perfect loop checks fiction. If the simulator attacks the way the engineer imagines customer behavior, it won’t find real failures — people break the agent in ways the developer doesn’t expect. This is the same illusion of green that falsifies quality, only carried over to the level of the test agent.

Engineering model: Customer Simulator → Target Agent → Judge

The working architecture is a pipeline of three agents, and each role solves one of the problems above.

Customer simulator. This is the attacking agent, and its system goal is phrased the inverse of the target’s: not to help, but to maximally obstruct reaching the result. So it doesn’t slide into cooperativeness, it’s given not “be a customer” but a specific hard-behavior model from a library: anxious, conflict-prone, forgetful, suspicious, impulsive, bored. Each model is a distinct pressure profile: the forgetful one returns to closed questions, the conflict-prone one argues and changes premises, the suspicious one disbelieves and re-asks. This library of “hard people” is an engineering artifact that is maintained and grown, not a one-off setting.

Target agent. The very one that will go to customers, in full production configuration: the same tools, the same constraints, the same stack. No lightweight versions — otherwise the loop lies.

Judge. A separate evaluator agent that looks at the outcome: did the dialog reach a real result for the user, did the target agent emit anything dangerous or false, did it hold the goal under pressure. Crucially, the judge evaluates the fact of the outcome, not the process. So the judge doesn’t play along itself, it’s given clear outcome criteria and kept independent of the simulator.

Tying it all together is a corpus of real dialogs — actual chats, calls, tickets. From it the simulator draws realistic scenarios and calibrates pressure, so it attacks meaner than live people but within their bounds, not more absurd than reality.

data

Inter-agent protocols are becoming a standard

1000+

MCP servers available by early 2025 (protocol introduced late 2024)

97M+

SDK downloads per month a year after launch

4 of 4

of the largest model providers adopted it (Anthropic, OpenAI, Google, Microsoft)

Machine coordination is already getting its protocols — an infrastructure shift, not a hypothesis. The future engineering of AI systems is built around such standards.

Source: Model Context Protocol, годовой обзор, 2025 https://blog.modelcontextprotocol.io/posts/2025-11-25-first-mcp-anniversary/

These numbers aren’t about QA directly, but they show the key thing: machine coordination is already getting its standards. When an inter-agent protocol gathers thousands of servers, tens of millions of downloads, and support from all the largest model providers within a year, “agent versus agent” stops being an experiment and becomes an infrastructure shift. QA through an agent pipeline is a specific but telling case of that shift: we will check agents with agents, and the shared infrastructure for it is already being built.

Architecturally the “simulator → agent → judge” pipeline is a multi-agent system in its purest form: several agents with different goals, coordination, outcome evaluation. So it leans on the competence of multi-agent system architecture and AI agent orchestration, and inherits the same risks — role desync, context leaking between agents, a judge that quietly plays along.

Practical takeaway for the business

The main conclusion: quality control of a conversational agent isn’t a line in the release checklist but a separate multi-agent system that has to be built and maintained. Its value to the business is a scale of testing unreachable by hand, and moving the cost of defects to before the customer.

What it gives in money. One class of failures found in advance — the agent that confidently tells the customer something wrong — pays for the loop, because the cost of such a failure in production is measured in lost deals and reputation, not a line in a bug tracker.

What to set up. A test pipeline of a simulator with a library of hard roles, a target agent in production configuration, and an independent judge over a corpus of your real dialogs. This is the same multi-agent engineering as in production agentic systems, so it’s built by the same hands — as a separate task for AI agent development, not a touch-up by the authors of the main agent.

What not to do. Don’t confuse a naive simulator with a red loop: a simulator with no hard-behavior models checks ideal conditions that don’t exist in production. Don’t assess quality by internal calls — judge by the outcome for the customer. And don’t assume a one-off manual run before release replaces a persistent loop: the agent changes on every iteration, and it has to be checked on each, not once. A green report without an opponent agent isn’t proof that a real person reaches the result.

Open questions

How mean to make the simulator is an open calibration question. Too weak finds no failures; too absurd finds nonexistent ones and forces the team to defend against the impossible. Tying it to a corpus of real dialogs keeps the pressure within realistic bounds, but the boundary itself is set by the cost of error in the specific process.

Who judges the judge is a separate engineering task. If the evaluator quietly plays along or measures the wrong outcome, the whole loop yields a false green. Here clear outcome criteria, the judge’s independence from the simulator, and periodic reconciliation of its verdicts with human assessment on a sample all help.

And a management question: who decides, and by what threshold, that the agent has survived enough agent attacks to go to customers. That threshold depends on the cost of error and is set before launch, not derived after the fact from the first incident.

If you check a conversational AI agent by hand or with single-utterance autotests, you see a sample, not reliability. — we’ll design an agent QA loop for your agent and your real dialogs.