Skip to content
Carbonfay
RU

engineering notes

Problems of multi-agent systems and how to avoid them

A breakdown of typical multi-agent failures — looping, context drift, cost growth — and engineering ways to avoid them.

In brief for executives. Multi-agent systems fail in production predictably, and almost never because of a “weak model”. Studies over thousands of runs show that the overwhelming share of failures is fuzzy specification and coordination breakdowns — i.e. architecture. The good news for business: since failures are predictable, they are visible already on the pilot — and the decision to scale can be made on concrete signs, not on the impression from a demo.


When a multi-agent system is launched into production, problems do not appear at random. They repeat project after project, and they have names. Let’s go through what exactly breaks, why, and how to catch it in advance.

Failures are predictable — which means they’re visible already on the pilot.

The natural reaction to an AI-system failure is “the model isn’t smart enough, let’s wait for the next one”. In practice, swapping the model barely moves the reliability of a multi-agent system, because what breaks is not an individual agent’s intelligence but how the agents are connected.

data
Why multi-agent systems fail (1,600+ execution traces)
Unclear specification: roles, tasks, constraints42%Coordination breakdowns: communication, state, goals37%Verification gaps: no validation or quality checks21%

Nearly 80% of failures are specification and coordination — i.e. architecture, not a «weak model». Fixed by contracts and explicit coordination, not by swapping the LLM.

Source: Why Do Multi-Agent LLM Systems Fail? (MAST, UC Berkeley), NeurIPS 2025 https://arxiv.org/pdf/2503.13657

The numbers are telling: specification and coordination give nearly four-fifths of failures. These are engineering categories, not “AI quality”.

Problem: four typical failures

Looping. An agent cannot solve a task, hands it to the coordinator, which returns it reworded, and the cycle repeats until a timeout. Every iteration is paid-for tokens with no result.

Context drift. To “not lose quality”, the whole inter-agent history is stuffed into each agent’s context. Step by step the context grows, answers “drift”, and the bill grows faster than the number of tasks.

Divergence. The same input yields different answers because state is fixed nowhere: each run reassembles context anew. For the business this means the system cannot be relied on in a process with accountability.

Acting on a wrong assumption. An agent does not clarify the ambiguous and proceeds on a wrong premise — one of the most expensive failures, because the result looks plausible and passes further down the process.

Why the usual approaches don’t work

“Add another agent” increases the number of links. If links are implicit (agents agree in free text), each new link is a new source of ambiguity. The system becomes not more reliable but more tangled.

“Improve the coordinator’s prompt” hits the fact that a prompt is not a protocol. Free text between agents cannot be validated, versioned and debugged like a typed message. Until the exchange has a shape, coordination rests on luck.

“Wait for a stronger model” doesn’t help, because a stronger model doesn’t know that its premise is wrong, the context is stale, and the cycle isn’t converging. These are properties of the system, not the model.

Engineering model: how to avoid each failure

Against each failure there is a concrete architectural device.

Against looping — hard iteration limits, timeouts and early exit on low confidence with escalation to a human. A cycle that doesn’t converge must stop, not burn tokens.

Against context drift — a context budget per step: the model is fed a reranked minimum, not “everything accumulated”.

Against divergence — fixed process state and idempotent steps: a repeat doesn’t change the result, the route is reproducible.

Against acting on a wrong assumption — contracts with a mandatory “unsure” field and a rule: on low confidence the step does not silently proceed but clarifies or hands off to a human.

On top of everything — observability: every step tagged, with result, cost, latency and error rate. Without it none of these failures is visible until the bill or the incident.

Practical takeaway for business

Failures are predictable — so they can be turned into a pilot checklist. Ask the contractor: what stops looping; how context per step is limited; is state fixed; what the system does on low confidence; how all of this is visible in tracing. If there are no concrete answers to these five questions — the system is not ready for production load, however convincing the demo.

data
Agentic AI: adoption surges, outcomes lag
Enterprise apps with agentic AI, 20241%Forecast for 202833%Agentic AI projects to be scrapped by end of 202740%

Adoption grows by an order of magnitude, yet nearly half of projects never reach an outcome. The gap between a demo and a working system is closed by architecture, not by the model.

Source: Gartner, 2024–2025 https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027

This is exactly why nearly half of agentic projects get scrapped: not “the technology doesn’t work” but architectural defects not closed before scaling. The signs are visible on the pilot — that’s where to look before investing in a rollout.

Apply this to your processes — .

Open questions

Where the reasonable limit of autonomy lies depends on the error cost in the process and is set deliberately, not by default. How to measure the reliability of a distributed process before deployment has no mature industry standard; we rely on reproducibility on historical data and the share of cases driven to a result without escalation. How much verification is enough is an open trade-off: excessive verification raises the cost of every step, insufficient lets expensive errors through.


If you have a process where several roles coordinate work and the error cost is high — the typical failures can be closed architecturally from the start. — we’ll work out the risk profile and where the handoff boundaries to humans run.

related cases

Next step

Let's design an AI-native automation layer for your operations.

DBCV