engineering notes
Building multi-agent systems: architecture that doesn't fall apart
How to design multi-agent systems that work in production: roles, contracts, coordination, fault tolerance and predictable cost.
In brief for executives. A multi-agent system pays off not because “the company now has AI”, but because it removes repetitive coordination from people: task routing, data reconciliation, status tracking. A prototype made of chained prompts is assembled in a week and looks convincing in a demo. It falls apart in production not because of the model, but because of the absence of architecture: contracts, state, failure handling and handoff points to humans. The decision to scale should be made not from the demo but from whether these four things are designed in — they determine both cost of ownership and the price of an error.
The demo looks like this: several LLM-based agents pass a task to one another — one searches for data, another phrases the answer, a third checks it. On the test example it all works, the client is happy, the decision is “let’s launch it on the real flow”. A month later, in production, the system loops on hard cases, returns different answers to the same request, and the model bill grows faster than the volume of processed tasks. Then come attempts to “tweak the prompts” and an argument over whether to revert to the manual process.
This is a typical story, and the problem in it is not a model problem but an engineering one. A multi-agent system that doesn’t fall apart in production differs from the demo not by prompt quality but by having an architecture.
The system falls apart not because of the model — but because of the absence of architecture.
Hypothesis: multi-agency is a distributed process, not “many prompts”
The word “agent” colloquially means almost anything: a model call with an instruction, or an autonomous entity making decisions. Because of this blur, a multi-agent system is often built as a set of prompts calling each other in a loop.
A working definition is different. An agent is a process participant with one area of responsibility, a limited toolset, a defined input/output format and predictable behaviour on failure. A multi-agent system is a distributed process over such participants, with explicit coordination and typed message exchange. The key word here is not “agent” but “process”: robustness is determined by how participants are connected, not by how good each individual prompt is.
Adoption is near-universal, but measurable business impact is rare. The gap is not access to AI — it is whether AI was taken to a managed process.
Problem: the demo lives on the “happy path”, production on exceptions
A prototype walks one route: a clear request, data in place, the model answers on the first try. On a real flow, the share of such requests rarely exceeds half. The rest are ambiguous phrasings, missing data, contradictory sources, external services that respond with delay or error.
Under these conditions, a system of chained prompts shows three persistent failures. First — looping: an executor agent cannot solve a task, hands it to the coordinator, which returns it reworded, and the cycle repeats until a timeout; every iteration is paid-for tokens with no result. Second — divergence: the same input yields different answers because state is fixed nowhere; for the business this means the system cannot be relied on in a process with accountability. Third — silent cost growth: to “not lose quality”, the whole inter-agent history is stuffed into each agent’s context, request length grows step by step, and the bill grows faster than the number of tasks.
Adoption grows by an order of magnitude, yet nearly half of projects never reach an outcome. The gap between a demo and a working system is closed by architecture, not by the model.
Why the usual approaches don’t work
The natural reaction is “add another agent” or “improve the coordinator’s prompt”. It doesn’t help, because it treats the symptom, not the cause. Adding an agent increases the number of links; if links are implicit (agents “agree” in free text), each new link is a new source of ambiguity — the system becomes not smarter but more tangled. Improving the coordinator’s prompt hits the fact that a prompt is not a protocol: free text between agents cannot be validated, versioned and debugged the way a typed message can. The root is that the system lacks three things without which a distributed process works neither with AI nor without it: contracts at the seams, fixed state, and explicit failure handling. Prompts do not compensate for that.
Engineering model: four pillars of the architecture
A multi-agent system that holds up in production stands on four pillars. None is about model choice.
Contracts at each agent’s input and output. An agent has a typed format for what it accepts and what it returns — not “free text” but a structure with mandatory fields and an explicit “couldn’t / unsure” flag. A contract is a boundary of responsibility: it shows exactly which participant failed, and it can be replaced or reworked without touching the rest. Contracts are versioned: a format change is a managed event, not a silent break of a neighbouring agent.
Fixed process state. Each task has state living outside an individual model call: what’s done, which steps passed, which data arrived. State makes steps idempotent — a repeat doesn’t corrupt the result — and ensures reproducibility: the same input follows the same route rather than being reassembled each time.
Explicit coordination instead of negotiating in text. It must be clear who decides the next step. Either a coordinator with routing by the previous step’s result, or an event-driven scheme where a step emits an event and the next participant is subscribed to it. In both cases the route is set by architecture, not inferred by the model from free correspondence. Coordination has limits: a cap on iterations, timeouts, early exit on low confidence. A cycle that doesn’t converge stops and escalates to a human rather than burning tokens until timeout.
Handoff to a human as part of the process, not an emergency exit. Where the price of an error is high or confidence is low, the process routinely hands the decision to a human with the context already gathered. This is not “the system failed” — it is a designed step. The boundary at which an agent must stop is defined by business rules, not by the model.
On top of these four pillars sits observability: every step is tagged, and for each one result, cost, latency and error rate are collected. Without it you can neither debug divergence nor find which step eats the budget. Model choice in this architecture is a swappable primitive: the model is picked per step by required quality, risk and latency, not set as a global constant. Swapping the model on a step must not break the system — that is exactly what the contracts ensure.
Practical takeaway for business
Several applied things follow for a leader.
First, the demo says nothing about robustness. A convincing prototype shows the task is solvable in principle but not the cost of ownership or behaviour on exceptions. The decision to scale should be made on a different question: are contracts, state, failure handling and handoff points designed in. This can be checked on a pilot with direct questions, without being an engineer.
Second, cost predictability is a consequence of architecture, not of negotiating with the model provider. If spend grows faster than the volume of tasks, somewhere the context length between agents is silently growing; this is found via tracing in hours if observability is built in, and not found at all if it isn’t.
Third, the value of a multi-agent system for the company is not “AI was adopted”, but removing coordination load: task routing, reconciling data across systems, status tracking. That is the basis for an ROI calculation — by a specific process and the routine removed from people, not “by the system on average”.
And fourth, a clear architecture is a managed cost of change for years ahead. A system with explicit contracts is reworked piece by piece; a system of intertwined prompts is rewritten wholesale on every significant change.
Apply this to your processes — .
Open questions
Being honest about the limits is also part of an engineering stance.
Where the reasonable limit of autonomy lies has no universal answer: the more decisions a system makes without a human, the higher the effect and the more expensive the error; the balance point depends on the process and is set deliberately, not “by default”.
Who is accountable for a decision made by an agent is not only a technical question. Architecture can provide traceability of every step, but distributing accountability is a management decision to be made before launch, not after the first contentious case.
How to measure the reliability of a distributed process before deployment is a problem without a mature standard. We rely on reproducibility on historical data and the share of cases driven to a result without escalation, but these are practice-derived references, not an industry metric.
If you have a process where people mostly coordinate — routing tasks, reconciling data, tracking statuses — that is a candidate for a multi-agent system. — we’ll work out what gets automated first and how to measure the effect.