
The Part of Voice AI Nobody Talks About
The pipeline is commodity. You can build it in a weekend. The orchestration layer — VAD, turn-taking, barge-in, state management — is the actual product. It's also the part nobody writes about.
Last post I broke down the voice AI pipeline: STT -> LLM -> TTS. Four boxes, one direction.
Everyone understood the pipe.
But the pipe is not what makes voice AI hard. You can stitch Deepgram + GPT-4o + ElevenLabs together in a weekend. It'll work in a demo. It'll fail in production.
What sits between and around those four boxes — the orchestration layer — is the actual product. It's also the part nobody writes about, because it's unglamorous and deeply technical.
Let's fix that.
VAD: When Did You Stop Talking?
Before any transcript reaches the LLM, the system needs to answer one question: is the user speaking right now, or not?
This is Voice Activity Detection (VAD). It sounds trivial. It's not.
VAD runs as a state machine with four states:
QUIET -> No audio detected
STARTING -> Speech onset detected, awaiting confirmation (~50ms)
SPEAKING -> Confirmed speech, streaming audio to STT
STOPPING -> Speech ending, holding for ~800ms before triggering LLM
The transitions are where everything breaks.
SPEAKING -> STOPPING is the critical one. When the user pauses, how long do you wait before deciding they're done? Too short, you cut them off mid-sentence. Too long, you create dead air.
English speakers tend to have clean sentence boundaries. Hindi speakers pause mid-thought — a 600ms pause doesn't mean the sentence is over. Gujarati speakers use filler sounds ("haan," "to") that aren't words but aren't silence either. Elderly callers speak slower with longer pauses between phrases.
The open-source standard is Silero VAD — a small neural network that classifies audio frames as speech or non-speech. It's fast (~2ms per frame), lightweight, and accurate on clean English audio.
Production voice AI companies spend months tuning VAD for their target population. This single component — not the LLM, not the TTS — is often the difference between "this feels like talking to a human" and "this keeps cutting me off."
Turn-Taking: Who Speaks When?
In human conversation, turn-taking is instinctive. You know when someone is done. You know when to jump in. You know when someone is thinking vs when they've finished.
AI has none of these instincts.
Turn-taking in voice AI is a set of rules layered on top of VAD:
Rule 1: VAD says the user stopped. But did they?
The system waits a threshold (typically 600-1200ms) after VAD transitions to STOPPING. If no new speech starts, it triggers the LLM response. This threshold is the single most important tunable parameter in voice AI. Too low = cut-offs. Too high = dead air.
Some systems use dynamic thresholds — shorter after a yes/no question, longer after an open-ended question. The LLM itself can signal expected response complexity, and the turn-taking logic adjusts the wait time accordingly.
Rule 2: Emotional silence is not a prompt to respond.
If a patient just received bad medical news — a failed IVF cycle, a concerning test result — and goes silent for 3 seconds, that silence is not an invitation to speak. It's grief. Processing. Shock.
A generic voice agent interprets 3 seconds of silence as "the user is done" and jumps in with "Would you like to schedule a follow-up?" This is where voice AI fails catastrophically in healthcare.
The fix is not simple. It requires the LLM to tag the emotional state of the conversation. If the last message delivered bad news, the turn-taking logic extends the silence threshold — sometimes to 5-8 seconds. Some systems play a soft backchannel ("I understand this is a lot to take in...") instead of proceeding with the next agenda item.
This is the kind of domain-specific engineering that separates "voice AI that works" from "voice AI that works in healthcare."
Rule 3: Backchanneling vs interruption.
The user says "mhmm" or "okay" while the AI is speaking. Is that:
- A backchannel (they're listening, keep going)?
- An interruption (they want to speak)?
- An acknowledgment (they understood, move to next point)?
Short, low-energy utterances during AI speech are usually backchannels. Loud, sustained speech during AI output is an interruption. The classifier runs in real-time on the audio stream, parallel to everything else.
Interruption Handling: The #1 UX Killer
The user starts talking while the AI is mid-sentence.
This happens constantly. In natural conversation, overlapping speech is normal. In voice AI, handling it wrong is the single most common UX failure in production deployments.
The system must execute a precise sequence within 200 milliseconds:
1. DETECT speech onset in the input stream while TTS is active
2. STOP TTS playback immediately (within 100-200ms)
3. CAPTURE what the user is saying (route input to STT)
4. DISCARD the pending LLM response (it's no longer relevant)
5. CONTEXT update the conversation state to include:
- what the AI had said so far (before interruption)
- what the AI was about to say (discarded)
- what the user said (the interruption)
6. RESTART send updated context to LLM for new response
If step 2 takes longer than 200ms, the user perceives the AI talking over them. If step 4 doesn't happen, the AI finishes its previous thought before addressing the interruption. Both feel broken.
This is called barge-in detection and it's the reason production voice AI sounds different from demo voice AI.
The Streaming Overlap: How Latency Actually Disappears
I mentioned streaming in the last post. Let me show you exactly how it works.
Naive (sequential) pipeline:
Time 0ms: User stops speaking
Time 300ms: STT finishes transcribing
Time 700ms: LLM finishes generating response
Time 1100ms: TTS finishes synthesizing audio
Time 1100ms: User hears first word of response
1.1 seconds of dead air. Unacceptable.
Production (streaming) pipeline:
Time 0ms: User is still speaking
Time -200ms: STT has already emitted partial transcript
Time 0ms: User stops speaking
Time 50ms: STT emits final transcript
Time 200ms: LLM emits first token of response
Time 250ms: TTS receives first sentence fragment
Time 350ms: TTS emits first audio bytes
Time 350ms: User hears first word of response
350ms. Less than a third of the sequential version. And the user was hearing natural silence (their own pause) for most of it.
The key insight: the three stages aren't sequential. They're concurrent streams that overlap in time. STT is still transcribing while the LLM is already generating. TTS is synthesizing the first sentence while the LLM is still producing the third.
The total perceived latency equals the slowest stage, not the sum.
Backchanneling: The 50% Latency Hack
Even with streaming overlap, the LLM needs ~200-400ms to produce its first token. During that window, the user hears nothing.
400ms of silence after you stop speaking feels short on paper. On a phone call, it feels like the other person didn't hear you. You start to repeat yourself. You say "hello?" The call is derailing.
Backchanneling fills this gap. The system plays short, pre-synthesized sounds during LLM processing time:
- "Hmm..."
- "Right..."
- "Okay..."
- "Achha..." (Hindi)
- "Haan..." (Hindi/Gujarati)
These are cached as audio — no TTS latency. The system triggers them the instant VAD detects the user has stopped speaking, before the LLM has started generating.
The result: the user perceives a listening human, not a computing machine.
State Management: The Memory Problem
A phone call is not a single exchange. It's a 5-30 minute conversation with context that builds.
The patient says their name at minute 1. At minute 15, the agent needs to use it. The patient mentions they're allergic to a medication at minute 3. At minute 20, when discussing prescriptions, the agent must remember.
The LLM has a context window — a fixed amount of text it can "see" at once. For a short call, the entire transcript fits. For a 30-minute call, it doesn't. The LLM starts losing the beginning of the conversation. It forgets the patient's name. It asks for information already provided. The patient hangs up.
The fix: external state.
Instead of relying on the LLM's context window to hold everything, the orchestrator extracts key entities into a structured session object:
{
"patient_name": "Priya Sharma",
"phone": "+91-9876543210",
"reason_for_call": "reschedule Day 12 monitoring",
"allergies": ["sulfa drugs"],
"current_mood": "neutral",
"appointment_status": "rescheduled to March 22, 10 AM",
"escalation_needed": false,
"call_start": "2026-03-20T14:30:00Z",
"key_facts_discussed": [
"confirmed new appointment time",
"reminded about fasting before blood work"
]
}
This object lives outside the LLM. On every turn, the orchestrator injects it into the system prompt. The LLM sees the current state plus the last few exchanges, not the entire 30-minute transcript. The conversation is ephemeral. The facts are persistent.
Fallback Logic: When AI Should Give Up
Voice AI is not a replacement for humans. It's a filter. It handles the 80% of calls that are routine so humans can focus on the 20% that need judgment, empathy, or clinical expertise.
The orchestrator runs a continuous classifier on every exchange:
Is this a routing call? -> Stay (AI handles)
Is this a medical symptom? -> Transfer to nurse
Is this emotional distress? -> Transfer to counselor
Is this a medication dosage question? -> Transfer to pharmacist
Is this a billing dispute? -> Transfer to billing team
Does the patient say "let me talk to a person"? -> Transfer immediately
Has the AI failed to resolve in 3 attempts? -> Transfer with context
The transfer is not just "hold please, connecting you." The orchestrator passes the full session state (structured object above) to the human agent's screen. The human sees the patient's name, reason for call, what was discussed, and what went wrong — before they say hello.
Concurrency: 50 Calls at Once
Everything above describes one call. Production systems handle 50, 500, or 5,000 simultaneously.
Each call is an independent state machine. Its own VAD instance. Its own STT stream. Its own LLM context. Its own session state. Its own position in the conversation.
This is distributed systems engineering:
- Process isolation: One call crashing must not affect others
- Resource management: 50 concurrent STT streams, 50 concurrent LLM requests, 50 concurrent TTS streams
- Queue management: When the LLM API is slow on one call, don't block the others
- State persistence: If the server restarts mid-call, can the call resume?
- Load balancing: Distribute calls across multiple servers as volume scales
At 50 concurrent calls, you need to think about LLM API rate limits. At 500, you need to think about GPU provisioning for self-hosted models. At 5,000, you need edge computing — processing audio at the nearest data center to minimize physical network latency.
This is why voice AI companies are, fundamentally, infrastructure companies. The AI model is a component. The orchestration, state management, and scale engineering is the business.
The Summary
The pipe (STT -> LLM -> TTS) is commodity. You can build it in a weekend.
The orchestration layer is the product:
- VAD that knows when Hindi speakers are pausing vs done
- Turn-taking that respects emotional silence
- Barge-in handling within 200ms
- Streaming overlap that makes 3 sequential stages feel instantaneous
- Backchanneling that fills LLM latency with culturally appropriate sounds
- External state management that survives 30-minute calls
- Fallback logic that's aggressive about routing to humans
- Concurrency that scales to thousands of simultaneous conversations
None of this is in the four boxes. All of it is in the spaces between them.
Next post: how this actually gets built. Pipecat, LiveKit, or custom. The frameworks, the architecture decisions, and what a production Indian healthcare stack looks like.
This is post 2 of a 4-part deep dive on voice AI engineering. Next: the frameworks — Pipecat's frame pipeline, LiveKit's infrastructure model, and why a real-world Indian deployment looks different from a demo.