engineering notes

Best approaches to AI agents for business: how to measure "best"

How to measure the "best" AI agent for business: reliability, cost of ownership, human control and embeddability — not the model.

In brief for executives. The “best AI agent for business” is not the one on the top model or prettier in a demo. It is the one that reliably embeds into your process with a predictable cost of ownership and human control on expensive decisions. Comparing agents by model means measuring the wrong thing; the criterion is behaviour in your process, not a demonstration.

The request “recommend the best AI agent” assumes there is a ranking. There is no ranking, because “best” depends on the process and is measured not by what is usually compared. Let’s go through what to measure by, in essence.

The “best” agent is the one that embedded into your process — not whoever has a newer model.

Hypothesis: “best” is about embeddability and cost of ownership

An agent creates value when it reliably does the work in your process and costs predictably to operate. The model behind it is a swappable component. So “best” is determined not by the model but by how the agent behaves on exceptions, how much it costs to own and how it is embedded into your systems.

data

Generative-AI pilots: share with rapid revenue growth

Most pilots «answer» but produce no impact — because an interface is built, not a process with state, contracts and control.

Source: MIT, отчёт 2025 (через Fortune) https://fortune.com/2025/08/18/mit-report-95-percent-generative-ai-pilots-at-companies-failing-cfo/

Most pilots “work” in a demo and produce no effect. So you must compare not the demo but the ability to reach a result in production.

Problem: people compare by model and demo

The typical comparison: whose model is newer, whose answer is prettier on the shown example, who has more “capabilities”. None of this predicts behaviour in your process: on exceptions, on an external-system failure, on an input-format change. A demo measures the best case; the business cares about the worst.

Why the usual approaches don’t work

“Take it on the strongest model” doesn’t work: model strength is not process reliability; cost of ownership is then often maximal.

“Take a ready-made agent with more features” doesn’t work: a ready-made agent doesn’t know your process, its exceptions and error cost.

“Compare by demos” doesn’t work: a demo is the best case, while cost and risk live in the worst.

data

Why multi-agent systems fail (1,600+ execution traces)

Nearly 80% of failures are specification and coordination — i.e. architecture, not a «weak model». Fixed by contracts and explicit coordination, not by swapping the LLM.

Source: Why Do Multi-Agent LLM Systems Fail? (MAST, UC Berkeley), NeurIPS 2025 https://arxiv.org/pdf/2503.13657

Reliability is set by how specification and coordination are closed — not by whose model is underneath.

Engineering model: how to measure “best”

Reliability on exceptions. What the agent does when data is missing, the answer is ambiguous, an external service is down. Worst-case behaviour is the main criterion.

Cost of ownership. Tokens at your volume, operation, support on process change. “Best” is predictable in cost, not cheap at the start.

Human control. Is there a designed handoff to a human on expensive decisions and is the decision path visible. Without it a “smart agent” is an unmanaged risk.

Embeddability. How cleanly the agent fits into your systems and processes, whether it survives a model swap behind a contract.

Observability. Is it visible what and why the agent did and what it cost. Without it you can neither compare nor manage.

Practical takeaway for business

Compare by a checklist, not by the model: behaviour on exceptions, cost of ownership, human control, embeddability, observability. These are the questions asked of a contractor before the start and checked on the pilot.

“Best” is the one that embeds into your process with a predictable cost, not the one with a newer model. A contractor who sells the model, not the architecture, is answering the wrong question.

Apply this to your processes — .

Open questions

How to measure reliability before deployment — we rely on reproducibility on historical data and the share of cases without escalation; there is no mature standard. Where the agent’s autonomy boundary lies is set by the process’s error cost. How to compare cost of ownership of different solutions honestly — only by a projection onto your volume, not a price list for “an agent”.

If you are choosing between solutions — we’ll compare them by behaviour in your process, not by models. — we’ll assemble a checklist for your error cost and volume.