Voice & telephony
Inbound Voice AI Agent
An inbound voice AI agent takes calls: transcribes speech, answers from the knowledge base, files a request and transfers to an operator by rule. With a latency budget and barge-in.
A voice line is the most demanding channel for an AI agent: the caller hears every delay and every unnatural pause. The inbound voice agent takes the call, understands the request, answers from the knowledge base and files a request, while routing hard calls to an operator by an explicit rule — but all of that only works if it stays within the latency budget.
What it does
It answers an inbound call, transcribes speech as a stream and detects intent. It retrieves the answer from your knowledge base and voices a short reply in a natural voice. If an action is needed, it files a request through a CRM or telephony integration. If the request is out of scope or the caller asks for a human, it transfers the call to an operator with ready context. Every step is testable and replaceable, not a single “phone-into-model” black box.
Where the line is
In voice the line runs along two axes. The first is the latency budget: the round-trip from “the person finished” to “the agent started answering” must fit inside a natural pause, or the conversation falls apart. That’s why streaming transcription, early synthesis start and barge-in aren’t decorations but the frame. The second is the human-handoff point: two misunderstandings in a row, a money matter, an upset customer or a direct request signal a transfer rather than an attempt to “push through” by voice.
More on the telephony engineering on the voice AI bots page; the answer retrieval inside the agent is the same RAG system as in text channels, only under a hard time limit per reply.
How the chain works
- 01Speech recognition (STT) · STT
Transcribes speech as a stream while the person is talking, not after the pause. Everything downstream depends on this step.
- 02Understanding and KB answer · mid model
Detects intent, retrieves the answer from the knowledge base and forms a short reply grounded in the source, without long monologues.
- 03Action or handoff · rule + model
Files a request through an integration or, by an explicit rule, transfers the call to an operator — with context spoken, not blind.
- 04Speech synthesis (TTS) · TTS
Voices the answer in a natural voice. Playback starts before the reply is fully composed, so the pause doesn't drag.
Integrations
+ any external API
Cost calculator
Estimate at a blended per-token rate (input+output). Exact cost depends on context length, number of calls and the share of manual review — we scope it to your process.
related cases