engineering notes

Why AI automation can suddenly become expensive

Where uncontrolled cost growth in AI automation comes from — context length, retries, bad routing — and how to keep the budget.

In brief for executives. The AI-automation bill grows not “all at once” but imperceptibly, step by step — and then arrives many times above the forecast. The paradox: the model’s price collapses while bills explode. The cause is not the token price but the architecture: context length, repeat calls, silent loops. Control is designed in before launch; “optimization by the bill” almost always means a temporary feature shutdown.

The scenario repeats project after project. The prototype is cheap, everyone likes it, it’s launched. Two months later the bill is many times above the forecast, and an argument starts over whether to turn off some features until it’s figured out. The problem is not financial but engineering.

The token gets cheaper. The habit of pouring everything into context does not.

Hypothesis: cost grows step by step, not all at once

AI-automation cost is a sum over process steps. Each step is model calls and context tokens. Growth happens not at once but by accumulation: a bit longer context, a few more calls, a few more retries — and in sum many times above the forecast.

data

Model inference price falls year over year

9–900×

annual inference-price drop — depending on the task

$20 → $0.07

per 1M tokens at GPT-3.5 level in ~18 months

The per-token price is collapsing — but that does not remove the need to manage cost: the token gets cheaper, not the habit of stuffing everything into context and calling the model needlessly.

Source: Stanford HAI, AI Index Report 2025 https://hai.stanford.edu/ai-index/2025-ai-index-report

The per-token price falls manifold per year — and that lulls: “it’ll get cheaper by itself”. It won’t, because volume grows faster than price.

Problem: the budget is planned from the prototype

In the prototype there are dozens of requests, short context, almost no retries. The “average length × price × number of requests” formula is understated by this data, because in production what wasn’t in the prototype appears: growing context, several model calls per result, retries and silent loops.

Why the usual approaches don’t work

“Wait, models get cheaper” doesn’t work: the token gets cheaper, not the habit of stuffing everything into context; volume grows faster than price.

“Optimize when it gets expensive” doesn’t work: by then the architecture is fixed, and the only fast lever is to cut features.

“Put a cheaper model everywhere” doesn’t work: on steps with a high error cost a weak model brings losses bigger than the saving.

data

Expected vs realized ROI of agentic AI

171%

average expected ROI of agentic AI in org surveys

<1%

of executives report significant ROI (≥20% to profit or savings)

$1.41

average return per $1 invested (savings + revenue growth)

Expectations run far above realized impact. ROI is computed honestly — by a specific process and full cost of ownership, not by a 171% expectation.

Source: Deloitte, AI ROI, 2025 https://www.deloitte.com/global/en/issues/generative-ai/ai-roi-the-paradox-of-rising-investment-and-elusive-returns.html

The gap between expectations and reality is largely from here: benefit was counted from the prototype, while the cost of ownership grew imperceptibly.

Engineering model: where cost grows and how to hold it

Context length. The main silent driver. Cured by a context budget per step: a reranked minimum is fed, not “everything found”.

Number of calls per result. One visible answer is often classification, rephrasing, answer, check. Cured by per-step model routing: routine on a cheap model, expensive decisions on a strong one.

Retries and silent loops. Timeouts, failed parsing, looping — these are paid-for tokens with no result. Cured by iteration limits and early exit with escalation.

Steps without a model. Logic and data work are done by code; the model only where needed. The cheapest token is the one not called.

Observability by step. Cost is collected by step. “Expensive” turns into “step X is expensive because of Y” — a task with a solution, not a reason to cut blindly.

Practical takeaway for business

Ask for a cost projection onto the real volume, not a figure from the prototype. If spend grows faster than the number of requests — context is silently growing somewhere; this is found via tracing in hours if observability is built in.

Design cost control into the architecture before launch. That is a few days of work in advance versus weeks of a scramble and feature shutdown later — for the same final functionality.

Don’t treat model price decline as a strategy. It is a tailwind, not budget management; you manage the architecture.

Apply this to your processes — .

Open questions

Where the limit of saving without quality loss lies is a per-step trade-off, not a general rule. How to forecast cost before the pilot — only as a range. How to relate the falling token price and growing volume in a long-term budget — the trends partly cancel, there is no precise method.

If the automation bill grows faster than the load — it is the architecture, and it is found via tracing. — we’ll decompose cost by step and where it is designed wrong.