Skip to content
Carbonfay
RU

Voice & telephony

Inbound Voice AI Agent

An inbound voice AI agent takes calls: transcribes speech, answers from the knowledge base, files a request and transfers to an operator by rule. With a latency budget and barge-in.

A voice line is the most demanding channel for an AI agent: the caller hears every delay and every unnatural pause. The inbound voice agent takes the call, understands the request, answers from the knowledge base and files a request, while routing hard calls to an operator by an explicit rule — but all of that only works if it stays within the latency budget.

What it does

It answers an inbound call, transcribes speech as a stream and detects intent. It retrieves the answer from your knowledge base and voices a short reply in a natural voice. If an action is needed, it files a request through a CRM or telephony integration. If the request is out of scope or the caller asks for a human, it transfers the call to an operator with ready context. Every step is testable and replaceable, not a single “phone-into-model” black box.

Where the line is

In voice the line runs along two axes. The first is the latency budget: the round-trip from “the person finished” to “the agent started answering” must fit inside a natural pause, or the conversation falls apart. That’s why streaming transcription, early synthesis start and barge-in aren’t decorations but the frame. The second is the human-handoff point: two misunderstandings in a row, a money matter, an upset customer or a direct request signal a transfer rather than an attempt to “push through” by voice.

More on the telephony engineering on the voice AI bots page; the answer retrieval inside the agent is the same RAG system as in text channels, only under a hard time limit per reply.

How the chain works

  1. 01
    Speech recognition (STT) · STT

    Transcribes speech as a stream while the person is talking, not after the pause. Everything downstream depends on this step.

  2. 02
    Understanding and KB answer · mid model

    Detects intent, retrieves the answer from the knowledge base and forms a short reply grounded in the source, without long monologues.

  3. 03
    Action or handoff · rule + model

    Files a request through an integration or, by an explicit rule, transfers the call to an operator — with context spoken, not blind.

  4. 04
    Speech synthesis (TTS) · TTS

    Voices the answer in a natural voice. Playback starts before the reply is fully composed, so the pause doesn't drag.

Integrations

Yandex SpeechKit GigaChat Bitrix24

+ any external API

Cost calculator

200
4
Tokens, ₽/mo
Development, ₽
Support, ₽/mo

Estimate at a blended per-token rate (input+output). Exact cost depends on context length, number of calls and the share of manual review — we scope it to your process.

related cases

faq

Straight answers

Will the delay be noticeable in conversation?
This is the main engineering parameter of a voice agent. We keep the round-trip latency budget — from "you stopped talking" to "the agent started answering" — within a natural pause. It's achieved by streaming transcription, early synthesis start and light models on steps that don't need a heavy one. If the budget can't be met, the honest choice is to skip voice and stay in chat.
What if you interrupt the agent mid-sentence?
The agent supports barge-in: when a person starts speaking, the agent stops and listens rather than finishing its line. Without that, the conversation feels like an answering machine. Barge-in is mandatory, not optional.
When does the call go to a live operator?
By an explicit rule: the agent failed to understand twice in a row, the topic is outside the knowledge base, the customer is upset or asks for a human, the matter involves money above a threshold. The transfer carries a short summary so the operator doesn't re-ask from scratch.
Where does the agent get answers and where does it file the request?
Answers come from your knowledge base via retrieval, not the model's memory. The request is filed through an integration with your CRM or telephony on the same contract operators use: it opens a ticket, records the contact, assigns a task.

Next step

Let's design an AI-native automation layer for your operations.

DBCV