engineering notes
Hidden hardcode in AI automation
How wired-in rules and prompt chains turn AI automation into technical debt and why it hits the cost of changes.
In brief for executives. Behind an impressive AI automation there is often fragile hardcode: wired-in rules, long prompt chains tuned to specific examples. This is a future cost of changes, visible on an audit, not in a demo. For the business one thing matters: a system expensive to change stops delivering an effect exactly when the process changes — i.e. always.
“Look, the AI does it all” in a demo often means “we wired all the needed cases into prompts and rules”. It works on the shown examples and breaks on new ones. Call it hidden hardcode: the automation looks intelligent but rests on rigidly fitted logic.
Behind “AI magic” there is often fragile fitting to the demo.
Hypothesis: behind “AI magic” there is often fitting to examples
A demo is prepared on known cases. The temptation is to get a perfect result on them by wiring in particular rules and lengthening prompts for each shown example. The result is a system optimized for the demo, not the flow: it doesn’t generalize, it reproduces what was fitted.
Problem: fragility and the cost of changes
Hidden hardcode doesn’t show immediately. The first weeks are fine: the shown cases work. Then the process changes — a new request type, a different document format — and the system starts erring where it “always worked”. Each change requires not configuration but editing of wired-in logic, and the cost of ownership grows faster than the value.
2024 is the first year more duplication is introduced than refactoring. Generation speed without architecture turns into technical debt — «hidden hardcode» at scale.
At scale this is already visible in the data: generation speed without architecture turns into a falling refactoring share and rising duplication — i.e. technical debt.
Why the usual approaches don’t work
“Add another rule for this case” doesn’t scale: the number of particular rules grows faster than coverage, and rule interaction becomes undebuggable.
“Lengthen the prompt, add examples” turns the prompt into an unreadable configuration without versions or tests — the same hardcode, only in text.
“Take a stronger model” doesn’t cure it: a strong model on fragile wired logic gives the same failures, just more nicely phrased.
Engineering model: contracts, configuration, testability
Contracts instead of fitting. Behaviour is set by typed inputs and outputs with explicit “unsure” handling, not by a set of particular “ifs” in the prompt.
Configuration instead of wired-in. What changes (routing rules, thresholds, formats) is moved into versioned configuration — not living inside the prompt or code.
Testability. There is a set of cases (including exceptions) the system is run against on every change. A regression is visible before production, not by a complaint.
Separation of model and logic. Logic, branching, data work are engineering layers; the model is called where needed, not “decides everything by prompt”. This also lowers token cost.
Decision observability. It is visible why the system made a decision — otherwise hardcode cannot even be found.
Practical takeaway for business
Fragility is a future bill, and it is visible on an audit. Before scaling, ask: what happens on a new input type absent from the demo; where rules are in configuration and where wired-in; are there tests on exceptions. Evasive answers are exactly the hidden hardcode.
Failure is systemic and predictable: not the model, but the absence of architecture (state, contracts, failure handling, handoff to humans).
The systemic, predictable failure of most AI initiatives is largely about this: fitting to the demo instead of architecture you can change.
Assess the cost of a change, not just of launch. A system expensive to change won’t survive the first process change — and processes always change.
Apply this to your processes — .
Open questions
Where the boundary is between reasonable configuration and excessive flexibility is a trade-off resolved per process. How many test cases are enough depends on the error cost. How to quickly detect “fitting to examples” on an audit of someone else’s system — we look at behaviour outside the shown cases, but there is no mature standard.
If you were shown an impressive demo — it is worth checking behaviour on what wasn’t in the demo. — we’ll look at where the system generalizes and where it is fitted.