engineering notes

Event-driven AI systems instead of simple scenarios

Why a linear scenario breaks on exceptions while an event-driven architecture makes an AI system robust and observable.

In brief for executives. A linear scenario “step 1 → step 2 → step 3” is pretty in a demo and fragile in production, because reality is not linear: events arrive out of order, services respond with delay, some steps repeat. An event-driven architecture is about operational reliability: fewer incidents, predictable behaviour under load, observability. This is an engineering decision with a direct consequence for the cost of operation.

Most AI automations are drawn as a straight line: a request came in, done A, then B, then C, returned the result. In a demo it works. In production reality delivers events in the wrong order and not one at a time — and the line snaps.

Reality is not linear — and systems shouldn’t be either.

Hypothesis: real processes are event-driven, not linear

In a real process simultaneously: new data arrives mid-processing; an external service responds later than the step that waited for it; the same signal arrives twice. These are not exceptions “for later” — this is the norm of the flow. A system designed as a line fights the very nature of the task.

Problem: a linear scenario lives on the “right order”

A linear pipeline implicitly assumes everything happens in order and exactly once. On a real flow this assumption is broken constantly: races, repeats, late responses. The line answers this with hangs, duplicated actions and silent losses — and is fixed with endless “ifs” on top of the original scenario.

data

Why multi-agent systems fail (1,600+ execution traces)

Nearly 80% of failures are specification and coordination — i.e. architecture, not a «weak model». Fixed by contracts and explicit coordination, not by swapping the LLM.

Source: Why Do Multi-Agent LLM Systems Fail? (MAST, UC Berkeley), NeurIPS 2025 https://arxiv.org/pdf/2503.13657

Most failures of multi-step AI systems are exactly coordination and state breakdowns: precisely what a linear scenario doesn’t model.

Why the usual approaches don’t work

“Add handling for this case” on top of the line doesn’t scale: each new “if” complicates the scenario and spawns new races.

“Add a retry on error” without idempotency leads to duplicated actions: a step repeat executes it twice.

“Wait for the response synchronously” turns an external service’s delay into a hang of the whole process.

Engineering model: events, state, idempotency

An event as the unit. A step doesn’t “call the next” but emits an event; whoever must react is subscribed to it. Order and parallelism stop breaking the process.

Process state. A task has explicit state: what happened, what’s done. A late or repeat event is handled correctly because there is something to reconcile against.

Idempotency. A step repeat doesn’t change the result or double the action. This is what makes repeats safe — and therefore permissible.

Escalation by event. Low confidence, timeout, contradiction — these are events the handoff to a human is subscribed to, not an “if branch at the end of a function”.

Flow observability. Events, their order, delays and handling are visible. An incident turns from “hung somewhere” into “event X not handled by subscriber Y”.

Practical takeaway for business

Event-orientation is about the number of incidents and the cost of operation. A linear system requires constant manual fixing of exceptions; an event-driven one moves exceptions inside the model and stabilizes.

data

Share of AI pilots that reach production

Per IDC, of 33 launched pilots only about 4 reach production. The cause of failure is not technology — it is the underestimated complexity of taking it to a process.

Source: IDC, 2025 (через CIO.com) https://www.cio.com/article/3850763/88-of-ai-pilots-fail-to-reach-production-but-thats-not-all-on-it.html

Part of why pilots don’t reach production is exactly here: a linear demo doesn’t withstand the real, unordered flow, and rewriting it into an event-driven one “later” is costlier than designing it in from the start.

Ask how the system behaves on a late and a repeat event. If the answer is “that won’t happen” — it will, and the cost of learning that in production is higher than designing for it in advance.

Apply this to your processes — .

Open questions

Where the limit of justified event-orientation lies — for simple rare processes a line is enough; the question is honest assessment, not a fashion for architecture. Event-driven systems are harder to debug without good observability — that is the built-in price of the approach. How finely to split a process into events is a trade-off between flexibility and complexity.

If your automation constantly requires manual fixing of “strange cases” — that is a symptom of a linear architecture. — we’ll look at the event flow and where it snaps.