glossary
Speech Recognition and Synthesis (STT/TTS)
Voice as an input/output channel for an AI system: speech-to-text and text-to-speech, latency budget, barge-in handling.
Speech recognition (Speech-to-Text, STT) turns an audio signal into text; speech synthesis (Text-to-Speech, TTS) turns text into audio. Individually these are two external models; together they form a voice channel for an AI system, with the same agent running between them as in a text interface. Voice is an input channel, not a standalone “voice magic”.
What drives voice dialog quality: latency budget (a reply slower than ~1.2–1.8 seconds breaks the dialog and forces streaming STT plus partial TTS), robustness to noise and accents on real telephony channels, barge-in handling (the user keeps talking — the bot must stop and listen), pause timing (too short interrupts, too long hangs). This is a separate layer of logic, not a model setting.
Where it works: voice AI bots, inbound service, outbound dialing, voice steps inside corporate AI agents. Voice is justified where the phone is the customer’s main channel; in B2B email and IT support it more often adds friction than removes it.