7 Ways Voice AI Fails in Production (And How to Fix Each One)

2025-12-28·12 min read·aivoiceengineeringproduction

A voice AI demo is impressive. A voice AI in production is humbling. Seven failure modes that only appear when real people make real phone calls in real environments — and the engineering fixes for each.

A voice AI demo is impressive. A voice AI in production is humbling.

The gap between the two is not compute, not models, not funding. It's the failure modes that only appear when real people make real phone calls in real environments.

Three posts ago I broke down the pipeline. Then the orchestration layer. Then the frameworks. This post is about what breaks when you deploy all of that into the world.

Seven failure modes. Each one includes: what breaks, why, how to detect it, how to fix it, and what it costs.

1. Barge-In Failure — The AI Talks Over You

What breaks: The user starts speaking while the AI is mid-sentence. The AI doesn't stop. It keeps talking over the user for 1-3 seconds before eventually detecting the interruption.

Why: Barge-in detection requires identifying speech onset in the input audio while TTS audio is being played on the output. The system is listening to itself talk while trying to hear the user. Echo cancellation must subtract the AI's own audio from the input stream before VAD can classify what remains as speech.

If echo cancellation is imperfect (which it always is on phone calls), the AI's own voice leaks into the input stream. VAD classifies this leak as speech. The system thinks the user is talking when nobody is. Or worse: the leak masks real user speech, so the system doesn't hear the actual interruption.

How to detect: Monitor two metrics:

Overlap duration: How many milliseconds of simultaneous speech (user + AI) occur before TTS stops. Target: <200ms. If regularly >500ms, barge-in is broken.
False barge-in rate: How often TTS stops when the user hasn't actually spoken (echo leak triggered it). Target: <5%.

How to fix:

Better echo cancellation. WebRTC has built-in AEC (Acoustic Echo Cancellation). For PSTN calls, you may need a dedicated AEC processor before VAD.
Separate the detection threshold for "user speaking while AI is silent" vs "user speaking while AI is talking." The latter needs higher confidence before triggering.
Use energy-based classification alongside VAD. Real speech has different energy patterns than echo leak. A quick FFT analysis can distinguish them.

Cost to fix: 1-2 weeks of audio engineering. The hardest part is getting test data — you need recordings of real calls with real echo patterns, not lab recordings.

2. Latency Spikes — The AI Goes Silent

What breaks: The user finishes speaking. Nothing happens for 2-3 seconds. Then the AI responds as if nothing happened.

Why: The LLM is the bottleneck. It accounts for 40-60% of total pipeline latency. And LLM APIs have variable latency. GPT-4o might respond in 300ms on one call and 1,200ms on the next, depending on server load, prompt length, and output complexity.

The tail latency (the worst 5% of responses) is what kills the experience. Average latency can be great while 1 in 20 responses creates an awkward 2-second pause.

How to fix:

Backchanneling. Fill LLM thinking time with filler sounds. This is mitigation, not a fix, but it works. (Covered in detail in post 2.)
Smaller models for routine turns. Use GPT-4o-mini or Claude Haiku for simple exchanges (confirmations, scheduling). Reserve GPT-4o for complex reasoning (medical triage, multi-step decisions). Dynamic model selection based on conversation state.
Speculative generation. For predictable conversation flows (appointment confirmation), pre-generate the likely response before the user finishes speaking. If the prediction is right, response time is near zero. If wrong, fall back to normal generation.
TTS caching. 70-80% of sentences repeat across calls. Cache the audio. Cached response latency: <100ms, regardless of LLM or TTS speed.
LLM provider redundancy. If GPT-4o is slow, fall back to Claude or Gemini. Route to whichever provider has the lowest current latency. This requires maintaining context compatibility across providers.

Cost to fix: Backchanneling takes 1-2 days. Dynamic model selection takes a week. Full provider redundancy takes 2-4 weeks.

3. Context Loss — The AI Forgets Your Name

What breaks: 15 minutes into a call, the patient mentions something they said at minute 2. The AI has no memory of it. It asks the patient to repeat information. The patient gets frustrated and hangs up.

Why: LLM context windows are finite. Even large context models (128K tokens) fill up on a verbose 30-minute call. But the real issue isn't size — it's attention degradation. LLMs pay less attention to information in the middle of a long context. The beginning and end get disproportionate weight. Your name at minute 2 might be in the context window but functionally invisible.

How to detect: Audit call transcripts for repeated questions. If the AI asks "What was your name again?" or "Can you remind me of your appointment date?" more than once per call, context management is broken.

How to fix:

Entity extraction to external state. After every 3-5 exchanges, run a secondary LLM call (fast, cheap model) to extract key entities: name, appointment, reason for call, allergies, medication, mood. Store in a structured session object outside the context.
Rolling context window. Instead of sending the entire transcript, send: system prompt + session state object + last N exchanges (where N = 5-10). The session object provides continuity. The recent exchanges provide conversational flow.
Summarization checkpoints. Every 5 minutes, summarize the conversation so far into a 2-3 sentence summary. Replace old context with the summary. This compresses a 15-minute transcript into a paragraph while preserving key facts.

Cost to fix: 1-2 weeks. The entity extraction pipeline needs testing against real call transcripts to ensure it catches all relevant information.

4. Voice Hallucination — The AI Confidently Says Wrong Things

What breaks: The AI confidently states the wrong appointment time. Or the wrong medication name. Or the wrong doctor's name. The patient acts on this information. They show up at the wrong time. They take the wrong dose.

Why: LLMs hallucinate. This is not a bug — it's the mechanism. They predict the most probable next token, and sometimes the most probable token is wrong. In text chat, the user can re-read, question, verify. In voice, the information flows past in real-time. There's no scroll bar. There's no "wait, let me re-read that." The patient heard it, believed it, and moved on.

How to fix:

Never freestyle on factual information. Appointment times, medication names, dosages, doctor names — these must ALWAYS come from a tool call (database lookup), never from the LLM's memory or generation. Hard rule in the system prompt: "Never state a date, time, medication name, or doctor name without calling the lookup tool first."
Constrained response templates. For factual responses, use templates with slots:
```
"Your appointment is on {date} at {time} with Dr. {name}."
```
The LLM fills the slots with tool-call results. The surrounding text is fixed. No room for hallucination on the critical data.
Read-back confirmation. After stating critical information, ask the patient to confirm: "I have you down for Thursday at 2 PM. Does that sound right?" This adds one exchange but catches hallucinations before they cause harm.

Cost to fix: 1 week for tool-call enforcement. Template system: 2-3 days. Read-back logic: 1 day.

5. Transcription Hallucination — Whisper Makes Things Up

What breaks: The patient pauses for 5 seconds. During that silence, the STT model generates a transcript of words that were never spoken. The LLM receives this fabricated transcript and responds to it. The patient is confused because the AI is addressing something they never said.

Why: Whisper (OpenAI's open-source STT model) has a documented hallucination problem. During silence or low-energy audio, it occasionally generates plausible but fabricated text. Studies have found hallucination rates as high as 27% in certain conditions (long silences, background noise, non-English audio).

The hallucinated text is often coherent and plausible — it's not random characters. Whisper might generate "I'm having chest pain" during a 10-second silence. In a medical context, this is not a UX problem. It's a safety problem.

How to detect: Cross-reference STT output with VAD state. If VAD says "QUIET" (no speech detected) but STT emitted a transcript, that transcript is likely hallucinated.

How to fix:

VAD gating. Only pass audio to STT when VAD confirms speech. During silence, STT receives no input and cannot hallucinate. This is the most important fix.
Use Deepgram Nova-3 instead of Whisper. Deepgram is purpose-built for real-time streaming and has significantly lower hallucination rates. It's also faster for streaming use cases.
If you must use Whisper: Use gpt-4o-mini-transcribe (OpenAI's newer model) which reduced hallucination by 90% compared to original Whisper. Or use Whisper with aggressive silence detection that blanks the audio input during quiet periods.
Confidence scoring. Some STT providers return confidence scores per word. Low-confidence transcripts during quiet periods should be discarded.

Cost to fix: Switching to Deepgram: 1-2 days. VAD gating: a few hours. Confidence filtering: 1 day.

6. Emotional Blindness — The AI Can't Hear Tears

What breaks: A patient receives a negative pregnancy test result. She's crying. The AI, working from text transcription alone, sees the words "okay" and proceeds with: "Would you like to schedule your next cycle? We have availability on--"

The patient hangs up. She calls the clinic to complain. The clinic stops using the AI system.

Why: The STT -> LLM -> TTS pipeline (cascading architecture) strips all emotional signals from the audio. STT converts speech to text. Text doesn't carry tone, pitch, speech rate, tremor, or crying. The LLM sees "okay" and treats it as confirmation. It has no idea the person is devastated.

This is the fundamental limitation of the cascading architecture. It's optimized for information extraction, not emotional intelligence.

How to fix:

Parallel emotion classifier. Run a lightweight audio classification model alongside STT. It doesn't transcribe — it classifies: neutral, distressed, angry, confused, happy. Feed this classification to the LLM as metadata alongside the transcript.
```
[TRANSCRIPT]: "okay"
[EMOTION]: distressed (confidence: 0.87)
[AUDIO_MARKERS]: speech_tremor, long_pause_before, reduced_volume
```
The LLM now knows "okay" doesn't mean "okay."
Conservative escalation. In healthcare, when in doubt, transfer. If the emotion classifier shows distress above a threshold, don't try to handle it. Route to a human immediately with the call context. The AI's job is to be reliable on routine calls, not to be a therapist.

Cost to fix: Parallel emotion classifier: 2-3 weeks (model selection, integration, threshold tuning). Hybrid architecture (switching between cascading and speech-to-speech mid-call): 4-6 weeks. Conservative escalation rules: 2-3 days.

7. Accent and Noise — The AI Can't Hear You

What breaks: A patient calls from a noisy clinic waiting room. They speak Hindi with a Rajasthani accent. The STT model transcribes gibberish. The LLM responds to the gibberish. The conversation spirals.

Or: an elderly patient speaks slowly and softly. The STT model picks up background TV audio more strongly than the patient's voice. The AI responds to the TV dialogue.

Why: STT models are trained primarily on clean, well-recorded American English. Each of these factors degrades accuracy:

Accent: A Rajasthani accent on Hindi has different phonetic patterns than the standard Hindi in training data
Code-switching: Indian callers mix Hindi and English mid-sentence ("Mera appointment Thursday ko hai na?"). Models trained on single languages struggle
Background noise: Clinic waiting rooms, traffic, TV, children. The signal-to-noise ratio drops
Phone audio quality: PSTN compresses audio to 8kHz. Half the frequency information is gone
Age: Elderly speakers often have reduced volume, slower pace, and more pauses. Models optimize for average adult speech patterns

Stack three of these (elderly + accent + noisy environment) and STT accuracy can drop below 60%. At that point, the voice AI is useless.

How to detect: Word Error Rate (WER) tracking per call. Segment by: caller demographics, language, time of day (proxy for noise level), call duration. Identify which segments have high WER.

How to fix:

Deepgram Nova-3 with keyterm prompting. Deepgram's multilingual models handle Hindi, Gujarati, and code-switching significantly better than Whisper. Keyterm prompting for domain vocabulary ("IVF", "beta-HCG", "embryo") improves accuracy on the words that matter most.
Noise suppression pre-processing. Run audio through a noise suppression model before STT. Products like Krisp (SDK available) or open-source models like Facebook's Demucs can isolate voice from background noise. This adds 20-30ms of latency but can improve WER by 30-40% in noisy environments.
Confirmation loops for low-confidence transcripts. If STT confidence is below a threshold on a critical piece of information (name, date, medication), the AI asks for confirmation: "I heard Thursday at 3 PM. Is that correct?" This adds friction but prevents errors.
Language detection and model switching. Detect the caller's primary language in the first 3-5 seconds. Switch to a language-optimized STT model. For Hindi, Deepgram or Google Cloud STT. For Gujarati, fewer options exist — Google Cloud has decent coverage, but accuracy testing is essential before deployment.

Cost to fix: Noise suppression integration: 1-2 days. Confidence-based confirmation: 3-5 days. Population-specific STT testing: 1-2 weeks of data collection and analysis.

The Meta-Lesson

Every failure mode on this list has the same root cause: the gap between demo conditions and production conditions.

Demos use clean audio, single language, short conversations, simple intents. Production has noisy audio, mixed languages, 30-minute calls, emotionally complex situations, elderly callers, background TV, and a patient who just received devastating news.

The pipeline (STT -> LLM -> TTS) handles demos beautifully.

The orchestration layer (VAD, turn-taking, state management) handles production.

The failure mode engineering (barge-in, latency, context loss, hallucination, emotion, accent, noise) is what separates companies that demo well from companies that deploy successfully.

Every fix on this list is unglamorous. Echo cancellation. VAD gating. TTS caching. Confidence thresholds. Noise suppression. Entity extraction. Escalation rules.

None of it is revolutionary. All of it is necessary.

The best voice AI system in the world is the one where the patient doesn't notice it's AI. That invisibility is the product of a hundred small engineering decisions, each one preventing a specific failure mode.

This completes the 4-part voice AI deep dive. The pipeline. The orchestration. The frameworks. The failures. Each post went one layer deeper than the last. Next block: How LLMs Actually Work.

← More posts Home