engineering notes

Cost-aware architecture for AI systems

How to design AI systems where cost is an engineering metric alongside latency and reliability, not a surprise at month's end.

In brief for executives. A predictable AI bill is the result of an architecture designed in before launch, not of optimization after the fact. The model’s price collapses, but that doesn’t make the system cheap: cost is set by how often and with what context the model is called. A cost-aware architecture is when cost is designed as a metric alongside latency and reliability. Done in advance, it costs a few days; done from the bill, weeks of analysis and a temporary feature shutdown.

The story repeats: a system is launched, two months later a bill arrives many times above the forecast, a “optimization” scramble begins. The problem is not financial but engineering: cost was not part of the architecture.

A predictable AI bill is architecture, not luck.

Hypothesis: cost is designed, not optimized later

An AI system’s cost is not a consequence of the model’s price but of the architect’s decisions: how many calls per result, what context in each, which model on which step. These decisions are made at design time. Optimizing “by the bill” is more expensive and almost always means a temporary shutdown of part of the functionality.

data

Inference price at GPT-3.5 level (per 1M tokens)

$20.00

November 2022

$0.07

October 2024

×280

price drop in roughly 18 months

The model's own price is collapsing — so system cost is set not by it, but by the architecture around it: context length, number of calls, routing.

Source: Stanford HAI, AI Index Report 2025 https://hai.stanford.edu/ai-index/2025-ai-index-report

The per-token price fell hundreds of times — and bills still blow up. So it is not the model’s price but the architecture around it.

Problem: cost is counted after the fact

The budget is planned from the prototype: average request length × price × number of requests. This formula lacks what eats the budget in production: growing context, several model calls per visible result, retries and silent loops. So production turns out many times more expensive than the prototype, and it is learned from the bill.

Why the usual approaches don’t work

“Wait, models get cheaper” doesn’t work: the per-token price gets cheaper, not the architectural habit of stuffing everything into context; volume grows faster than price falls.

data

Why «it will get cheaper by itself» is not a cost strategy

−30%/yr

decline in inference hardware cost

+40%/yr

gain in energy efficiency

Infrastructure gets cheaper on its own, but an AI agent's cost of ownership is set by what the engineer designs in: per-step model routing, context budgets, breaking silent loops.

Source: Stanford HAI, AI Index Report 2025 https://hai.stanford.edu/ai-index/2025-ai-index-report

“Optimize later if it gets expensive” doesn’t work: by “expensive” the architecture is already fixed, and the only fast lever is to turn off features.

“Take a cheaper model everywhere” doesn’t work: on steps with a high error cost a weak model creates losses bigger than the saving.

Engineering model: cost as a design metric

Per-step model routing. The model decision is a function of the step (risk, required quality, latency). Routine steps on a cheap fast model, expensive decisions on a strong one. This removes the main share of spend without quality loss where it wasn’t required.

A context budget per step. A hard token ceiling per step stops length from silently growing. This is at once about cost and quality.

Steps without a model. Logic, branching, data work are done by ordinary code; the model is called only where truly needed. The cheapest token is the one not called.

Breaking silent loops. Iteration limits, timeouts, early exit. A non-converging cycle stops and escalates rather than burning tokens.

Cost observability by step. Every call is tagged with a step; cost is collected by step. “The system is expensive” turns into “step X is expensive because of Y” — a task with a solution.

Practical takeaway for business

Require two figures: development cost and monthly run cost at your volume. A project without the second figure is not estimated for cost of ownership.

Ask about cost observability before launch. If cost cannot be decomposed by step, it cannot be managed — only features cut by the bill.

Design cost control into the architecture, not into an “optimizations” roadmap. That is a few days of work in advance versus weeks of a scramble later — for the same final functionality.

Apply this to your processes — .

Open questions

Where the limit of saving without quality loss lies is a trade-off resolved by per-step measurement, not a general rule. How to balance cost and latency when a cheap model is slower on the needed step is an open engineering question. How to forecast cost before the pilot — only as a range; the precise figure appears on the real flow.

If the AI bill grows faster than the load — the problem is in the architecture, and it is found via tracing. — we’ll decompose cost by step and where it is designed wrong.