How Voice AI Actually Works — No Buzzwords

2025-12-10·6 min read·aivoiceengineering

You call a clinic. A voice picks up. It sounds human. No human was involved. Here's how the four-box pipeline actually works — STT, LLM, TTS, and the latency equation that makes it feel real.

You call a clinic. A voice picks up. It sounds human. It asks how it can help. You say you want to reschedule your appointment. It pulls your records, confirms a new time, sends you an SMS confirmation.

No human was involved.

That's voice AI. Not Alexa. Not Siri. Not "press 1 for billing." A system that handles real phone calls, makes real decisions, and takes real actions.

Here's how it actually works.

The Pipe: Four Boxes, One Direction

Every voice AI call runs through the same basic chain:

User speaks
    -> STT (Speech-to-Text)
    -> LLM (Language Model)
    -> TTS (Text-to-Speech)
User hears response

That's it. Four components. Let me break down what each one actually does.

Box 1: STT (Speech-to-Text)

Your voice arrives as raw audio — a waveform. The STT model converts that waveform into text.

The important thing: modern STT doesn't wait for you to finish your sentence. It transcribes while you speak. Every ~20 milliseconds, it emits a partial transcript. You say "I want to reschedule my—" and by the time you reach "my," the system has already transcribed "I want to reschedule."

This is called streaming transcription. It's the difference between a 2-second delay and a 200-millisecond one. Batch transcription (wait for silence, then transcribe the whole thing) is what Siri did in 2012. Streaming is what production voice AI does now.

Who makes STT models:

Provider	Latency	Cost	Notes
Deepgram Nova-3	<300ms streaming	$0.0077/min	Production standard. 45+ languages. Medical variant available.
Google Cloud STT	~300ms	$0.006-0.009/min	Strong multilingual. Good for Hindi.
Whisper (OpenAI)	Varies	Free (self-hosted)	Open-weight. 680K hours of training data. But hallucinates during silence.
AssemblyAI	~300ms	$0.0025/min (streaming)	Includes sentiment analysis, PII detection bundled.

Box 2: LLM (Language Model)

The transcript hits the LLM. This is GPT-4o, Claude, Gemini, or any model you choose.

The LLM does two things:

1. Decides what to say. It reads the transcript, understands intent, and generates a response. "I'd like to reschedule" -> "Sure, I can help with that. What day works for you?"

2. Decides what to do. This is the part most people miss. The LLM doesn't just generate text. It can call functions — book an appointment in the calendar, pull patient records from the database, send an SMS, transfer to a human. This is called tool calling (or function calling).

The LLM sees a list of available tools:

[
  {"name": "book_appointment", "params": ["patient_id", "date", "time"]},
  {"name": "get_patient_record", "params": ["phone_number"]},
  {"name": "send_sms", "params": ["phone_number", "message"]},
  {"name": "transfer_to_human", "params": ["reason"]}
]

When the patient says "What time is my next appointment?", the LLM doesn't guess. It calls get_patient_record with their phone number, reads the result, and speaks the actual appointment time. When the patient confirms a new time, it calls book_appointment and then send_sms with the confirmation.

The LLM is not just a chatbot. It's a decision engine with hands.

The latency number that matters: Time to first token (TTFT). This is how long the LLM takes to start generating its response after receiving the transcript. Target: under 400ms. GPT-4o-mini hits ~200ms. GPT-4o hits ~400ms. Claude Haiku hits ~250ms.

Box 3: TTS (Text-to-Speech)

The LLM's text response needs to become audio. TTS does this.

Modern TTS doesn't generate the entire response as one audio file and then play it. It streams. The moment the LLM produces its first sentence, TTS starts converting that sentence to audio. The user hears the first words before the LLM has finished generating the full response.

The metric: Time to First Audio Byte (TTFAB). How quickly TTS produces its first chunk of playable audio.

Provider	TTFAB	Quality	Notes
ElevenLabs Turbo v2.5	<75ms	Best in class	32 languages. Voice cloning. The quality benchmark.
Cartesia Sonic 3	<90ms	Very good	Lowest latency. Fine-grained prosody control.
Deepgram Aura-2	~100ms	Good	25% cheaper than competitors. Built for agents.
Play.ht	~100ms	Good	Easy integration.

ElevenLabs is what you choose when the voice needs to sound truly human. Cartesia is what you choose when every millisecond matters and you'll trade a small quality difference for speed.

Box 4: The Phone Connection

All of this runs on a server. But the caller is on a regular phone. Connecting them requires SIP trunking — a protocol that bridges the internet (where your AI lives) and the public phone network (where the caller lives).

User's phone
    |
  PSTN (telecom network)
    |
  SIP Trunk (Twilio / Telnyx / Exotel)
    |
  Your voice AI server

Twilio is the default. Simple API, massive ecosystem, works everywhere. Telnyx runs its own private network instead of routing over public internet — better call quality, ~25% cheaper, more setup complexity. Exotel is the standard for Indian telephony.

This is commodity infrastructure. You don't build it. You plug into it.

The Latency Equation

Here's why every millisecond in each box matters.

In a naive pipeline, each stage runs sequentially:

STT finishes (300ms)
  -> LLM finishes (800ms)
    -> TTS finishes (500ms)
Total: 1,600ms

1.6 seconds of silence after the user stops talking. That feels like a broken phone call. Humans tolerate 200-500ms of natural pause. Beyond 800ms, it feels robotic. Beyond 1.5 seconds, they hang up.

The fix: streaming everything.

In a production pipeline, the three stages overlap:

STT emits partial transcripts every 20ms while the user is still speaking
LLM starts generating on the partial transcript before the user finishes
TTS starts synthesizing the first sentence before the LLM finishes the full response
The user hears audio before the pipeline has even completed

Stage	Production Target
STT streaming	<300ms
LLM time to first token	<400ms
TTS time to first audio	<200ms
Total end-to-end	<800ms

Under 800ms feels acceptable. Under 500ms feels natural.

The One-Line Summary

That's the pipe. STT converts voice to text. LLM decides what to say and do. TTS converts text back to voice. SIP connects it to a phone.

These four components are commodities. You can swap any of them without changing the others. Deepgram for Whisper. GPT-4o for Claude. ElevenLabs for Cartesia.

The hard part is everything around these four boxes.

That's the next post.

This is post 1 of a 4-part deep dive on voice AI engineering. Each post goes one layer deeper. Next: the orchestration layer — the part nobody talks about.

← More posts Home