research
AI Economics: why the token bill grows unnoticed
Where uncontrolled AI-system cost comes from in production, how to measure it per workflow step, and which decisions actually cut spend without losing quality.
A team demos an assistant that answers questions over a knowledge base. Everyone likes it, it ships. Two months later the model-provider invoice arrives 6–8× above forecast. Then come the panic, the prompt-tweaking, and the argument about whether to switch it off until someone figures it out. This is a common story, and the problem in it is not financial but engineering: cost was never part of the architecture.
Why the forecast is always low
The forecast is usually: average request length × price per token × number of requests. That formula misses the three things that actually consume the budget.
First, context. In the prototype you put two or three relevant fragments into the model. In production, to “not lose quality”, people start adding dialogue history, system instructions, examples and everything retrieval found. Real request length is not 1–2K tokens but 8–15K. Price scales linearly with length — the bill scales with it.
Second, repeated calls. One answer to a user is rarely one model call. It is intent classification, query rewriting for search, the answer itself, sometimes a second model checking the answer, sometimes a retry on failure. Five model calls per visible answer is normal, not exceptional.
Third, retries and silent loops. A timeout, a failed parse, an agent stuck on a tool. Every such case is paid-for tokens with no result. In a system without observability they are invisible until the invoice.
Cost is a step metric, not a system metric
The main mistake is reasoning about cost at the “system on average” level. The average tells you nothing: 5% of requests can drive 60% of spend. You have to measure per workflow step.
In practice: every model call is tagged with the step it belongs to, and for each step you collect cost, latency and error rate. Then you don’t see “the system is expensive” — you see “the query-rewrite step costs more than the answer because the entire dialogue history is being stuffed into it for some reason”. That is an engineering task with a clear fix, not a reason to cut the budget blindly.
What actually reduces spend
Hybrid routing. Not every step needs a strong model. Classification, field extraction and short rewrites are handled fine by a cheap fast model. The strong, expensive one is needed where the cost of error is high. Model choice is a function of the step (risk, required quality, latency), not a global constant. On real workflows, moving routine steps to a cheap model removes 40–70% of spend with no visible quality loss — because quality wasn’t required there.
Context budgets. Each step gets a cap on how many context tokens it may use. This forces retrieval to return a re-ranked minimum, not “everything we found”. A context budget is not a quality limit, it is protection against silent request growth.
Caching. System instructions, tool descriptions and stable context repeat from request to request. Provider-side prompt caching or your own cache for idempotent requests removes paying twice for the same thing. On conversational systems this is often the cheapest to implement and the most visible in effect.
Cutting silent loops. Hard caps on agent iterations, timeouts, early exit on low confidence. A loop that doesn’t converge must stop and escalate to a human, not burn tokens until timeout.
What to treat as normal
A few field reference points — not guarantees, but anchors:
- If an answer costs an order of magnitude more than a competing step in the same workflow, that is not “an expensive model”, it is a context bug. Look at what is being put there first.
- If 10% of requests drive over half the spend, that is not a reason to optimize everything, it is a reason to dissect those 10%.
- If spend grows faster than request count, context length is silently growing somewhere. Tracing always finds it.
Conclusion
Token economics is a platform layer, like latency and reliability. You design it before launch: routing by step cost, context budgets, caching, loop cutoffs and per-step spend tracing. Done up front it costs a few days of work. Done after the invoice it costs weeks of investigation and usually a temporary feature shutdown. The difference is not technology — it is whether cost was treated as an engineering quantity from the start.