engineering notes

The bot that passes every test and doesn't sell

Why an AI bot with 95% green tests barely moves sales: tests check knowledge of facts, not handling doubt, holding the dialog, and bringing the customer back.

Executive summary. You can end up with an AI bot that passes almost every test and barely moves revenue. It’s not a paradox: tests usually check whether the bot knows the facts, while the sale is made by entirely different abilities — handling doubt, holding a slipping dialog, bringing the customer back to the point. If those abilities aren’t measured, the report will be green and the till unchanged. The good news: the gap is visible in advance and closes before launch.

A common picture: the bot is delivered, tests green, the team pleased — and a quarter later it turns out it barely affected sales. On inspection, the wrong thing had been checked. The bot knew routes, prices and specs perfectly, but fell apart exactly where the selling begins.

You can have 95% green tests and almost no impact on sales — if the tests checked the wrong thing.

Hypothesis: tests measured knowledge, but the sale is made by behavior

A sale in dialog isn’t delivering the right fact. It’s working with a hesitant person: recognizing wavering, holding attention, bringing back someone who got distracted, gently moving them to the next step. Tests, however, almost always check factual correctness: did the bot answer the question right. The two are weakly related. A bot can be flawless on facts and helpless at selling — and a green report won’t show it, because it measures the wrong ability.

Problem: a green report hides a commercial failure

The divergence is dangerous because it looks like success. Management sees a high pass rate and considers the job done. Budget spent, project “delivered”, business metric unmoved. Then the worst part kicks in: instead of concluding “we measured the wrong thing”, the company concludes “AI didn’t work for us”. One mis-evaluated bot sours the attitude toward the whole direction — and the next, correctly built project has to be defended against that residue.

data

Expected vs realized ROI of agentic AI

171%

average expected ROI of agentic AI in org surveys

<1%

of executives report significant ROI (≥20% to profit or savings)

$1.41

average return per $1 invested (savings + revenue growth)

Expectations run far above realized impact. ROI is computed honestly — by a specific process and full cost of ownership, not by a 171% expectation.

Source: Deloitte, AI ROI, 2025 https://www.deloitte.com/global/en/issues/generative-ai/ai-roi-the-paradox-of-rising-investment-and-elusive-returns.html

The gap between expected and realized ROI of agentic AI is exactly this: the effect is counted by expectation, not by what the system actually does to the metric. Green tests fit neatly into that gap — they confirm activity, not result.

Why the usual approaches don’t work

“Add more factual tests” doesn’t help — it refines the ability that’s already fine and says nothing about selling. “Ask the bot how it would sell” is useless: it will describe an ideal scenario that won’t exist in a live dialog. “Look at average dialog metrics” — length, message count, sentiment — also misses: these are pretty numbers that follow no single customer and don’t answer whether a specific person reached a purchase. The root is one: people measure what’s easy to measure, not what makes money.

Engineering model: test the ability to sell, not to know

For a test to reflect selling, it must reproduce its difficulty. In practice that means checking three things. First, handling doubt: the customer simulator objects, wavers, compares, leaves “to think it over”, and we watch whether the bot holds the dialog or gives up. Second, recovery after going off-topic: the customer got distracted by a side question — will the bot bring them back to the decision. Third, goal persistence: after twenty minutes of chaos, does the bot remember what the conversation was for. The source of scenarios is real sales and support dialogs, not synthetics, because live customers hesitate in ways a developer doesn’t invent. This is testing for the ability to reach a result, not knowledge of a handbook.

data

Almost everyone has adopted — few capture value

Adoption is near-universal, but measurable business impact is rare. The gap is not access to AI — it is whether AI was taken to a managed process.

Source: McKinsey, The State of AI 2025 https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai

The same logic at market scale: almost everyone has deployed generative AI, but only a few extract measurable value. The difference between those groups isn’t access to models — it’s whether the bot was taken all the way to affecting the outcome, not just answering.

Practical takeaway for the business

Measure the bot by outcome, not by facts. The key acceptance question isn’t “how many tests passed” but “how much did it move conversion in a dedicated segment”. If there’s no answer, the bot isn’t ready, however green the report.

Where this already worked. In one project for a large travel retailer (under NDA), the first version of the dialog assistant passed knowledge checks almost perfectly and didn’t affect sales. Rewriting the tests around handling doubt and holding the dialog — on a corpus of real inquiries — we found exactly where the bot “let go” of the customer and closed those spots. What shifted wasn’t the test score but the conversation itself.

What not to do. Don’t accept a bot on factual tests if the job is to sell. Don’t judge quality by average dialog metrics. And don’t write off “AI didn’t work” based on a bot measured with the wrong ruler — it’s almost always the ruler.

See where your bot loses the customer — .

Open questions

How to cleanly separate the bot’s contribution to a sale from everything else — you need a dedicated segment and an honest comparison, not overall growth. Where’s the line between “the bot sells” and “the bot gets in the way” — where it starts pushing against a hesitation that deserved respect; that’s settled not by the model but by a handoff-to-human rule. Who’s accountable for the bot’s commercial result — the owner of the sales process, not the development team, and that role is worth assigning before launch.

If your bot passes tests but doesn’t move sales, you almost certainly measured knowledge, not selling. — let’s find where the dialog loses the customer, and how to test for it.