
Pipecat, LiveKit, or Custom — How Voice AI Gets Built
Three real options exist for building voice AI. Each makes different tradeoffs. Pipecat's frame pipeline, LiveKit's infrastructure model, and what a production Indian healthcare stack actually looks like.
Two posts ago: the pipeline (STT -> LLM -> TTS). Last post: the orchestration layer (VAD, turn-taking, barge-in, state management).
Now: how this actually gets built. What framework you use. What architecture decisions you make before writing a single line of business logic.
Three real options exist. Each makes different tradeoffs. Choosing wrong costs you months.
Option 1: Pipecat — The Open-Source Standard
Pipecat is an open-source Python framework for building real-time voice AI pipelines. Built by Daily.co, a company that provides real-time video and audio infrastructure.
It's what most voice AI startups build on. It's what companies customizing voice AI for specific verticals (healthcare, finance, support) start with. If you're building a voice agent today and you don't have a reason to build custom, you're probably using Pipecat.
The Core Abstraction: Everything Is a Frame
Pipecat models real-time audio processing as a pipeline of frames. A frame is a small chunk of data — audio, text, or a control signal — that flows through a series of processors.
# Simplified Pipecat pipeline
pipeline = Pipeline([
transport.input(), # Raw audio frames from the call
vad_processor, # Silero VAD: classifies speech/silence
stt_processor, # Deepgram: converts audio to text frames
context_aggregator, # Collects text into conversation context
llm_processor, # GPT-4o: generates response text frames
tts_processor, # ElevenLabs: converts text to audio frames
transport.output() # Sends audio frames back to the caller
])
Audio arrives as frames — small chunks, roughly 20ms each. Each processor in the pipeline consumes frames, does its work, and emits new frames. The processors run concurrently. When the STT processor emits a text frame, the LLM processor starts working on it immediately, even if STT is still processing the next audio chunk.
This is the abstraction that makes voice AI buildable by small teams. Instead of managing raw WebSocket connections, audio buffers, sample rates, and byte streams, you chain processors and let frames flow.
Frame Types
Not all frames are audio. Pipecat has a type system for frames:
- AudioRawFrame — raw audio data (PCM samples)
- TranscriptionFrame — partial or final transcript from STT
- TextFrame — text chunks from LLM response
- TTSAudioRawFrame — synthesized audio from TTS
- UserStartedSpeakingFrame — control signal from VAD
- UserStoppedSpeakingFrame — control signal from VAD
- LLMFullResponseStartFrame — marks the beginning of LLM output
- EndFrame — signals pipeline shutdown
Transport Layer
Pipecat doesn't handle telephony itself. It delegates to a transport — the layer that manages the actual audio connection.
The default transport is Daily.co's WebRTC infrastructure. Your voice agent runs on a server. The caller connects through a Daily.co room. Audio streams bidirectionally over WebRTC.
For phone calls (PSTN), you need a SIP trunk. Pipecat integrates with:
- Twilio — most common, easiest setup
- Telnyx — private network, better quality, cheaper at scale
- Daily.co SIP — native integration, simplest with Pipecat
The SIP trunk bridges the phone network and WebRTC. Caller dials a regular number -> SIP trunk routes to Daily.co -> Daily.co streams to your Pipecat pipeline.
What Pipecat Does NOT Give You
Pipecat gives you plumbing. It does NOT give you:
1. Regional language VAD tuning. Silero VAD is the default. It works well on clean English. It does not handle Hindi pause patterns, Gujarati filler sounds, noisy Indian phone audio, or elderly callers with longer pauses. You must tune thresholds, retrain models, or build a secondary classifier.
2. Backchanneling / filler injection. There's no built-in "play 'hmm' while the LLM is thinking" processor. You build a custom processor that:
- Detects when
UserStoppedSpeakingFramewas emitted - Starts a timer
- If LLM hasn't emitted its first token within 300ms, plays a cached filler sound
- Stops the filler when LLM audio begins
This is maybe 100 lines of code. But it's 100 lines that reduce perceived latency by 50%.
3. State machines for conversation flow. Pipecat's LLM processor sends the full conversation history on every turn. For a long call, that's expensive and eventually exceeds context limits. You need to build:
- Entity extraction after every few turns
- External state storage
- Dynamic system prompt that changes based on conversation phase
- Token budget management
4. TTS caching. In many domains, 70-80% of sentences repeat across calls. "Your appointment is confirmed for..." "Please arrive 15 minutes before..." Instead of synthesizing these every time, you cache the audio and skip TTS entirely. Sub-100ms response for cached phrases.
What Real Customization Looks Like
Option 2: LiveKit — The Infrastructure Play
LiveKit takes a fundamentally different approach. Where Pipecat is a framework (you build pipelines), LiveKit is infrastructure (you build on top of rooms).
LiveKit started as an open-source alternative to Twilio and Daily.co for real-time communication. WebRTC infrastructure — rooms, participants, audio/video routing, recording. Then they added an Agents Framework on top: a Python/Node SDK for building AI participants that join rooms and interact with humans.
Architecture
LiveKit Server (rooms, audio routing, WebRTC, recording)
-> LiveKit Agents Framework (Python/Node SDK)
-> Your agent logic (STT + LLM + TTS + tools)
The LiveKit server handles all transport. Your agent is just another "participant" in a room. It receives audio from the human participant, processes it, and sends audio back.
What LiveKit Gives You That Pipecat Doesn't
Self-hostable infrastructure. Pipecat depends on Daily.co's cloud for WebRTC transport. LiveKit's server is open-source. You can run it on your own infrastructure. For healthcare companies with data residency requirements (patient audio cannot leave India/US), this matters.
Built-in room management. Multiple participants in a conversation. Record the call. Stream it. Add a human supervisor who can listen in. Add a second AI agent that handles a different language. All native to the platform.
Conversational turn detection. LiveKit's Agents Framework has built-in turn detection that goes beyond VAD. It considers:
- Audio energy levels
- Speech duration
- Semantic signals (did the LLM's response end with a question?)
- Configurable endpointing thresholds per language
Scale primitives. Distribute agents across GPU clusters. Load balance across regions. LiveKit Cloud (their hosted version) handles this. Self-hosted, you manage it but the primitives exist.
What Pipecat Gives You That LiveKit Doesn't
Simpler mental model. Frames in, frames out. No rooms, no participants, no server management. For a single voice agent answering calls, Pipecat is simpler.
Larger processor ecosystem. More pre-built integrations for STT/LLM/TTS providers. Community-contributed processors for specific use cases.
More community examples. Pipecat has been the default for voice AI builders longer. More tutorials, more example code, more Stack Overflow answers.
When to Choose Which
| Scenario | Choose |
|---|---|
| Single voice agent, ship fast | Pipecat |
| Need to self-host everything | LiveKit |
| 10K+ concurrent calls | LiveKit |
| Building a platform (not just agents) | LiveKit |
| Healthcare with data residency | LiveKit |
| Prototype / hackathon | Pipecat |
| Need recording, streaming, multi-participant | LiveKit |
Most startups start with Pipecat. Companies that outgrow it or need infrastructure control migrate to LiveKit or custom.
Option 3: Custom — What Platforms Do at Scale
Bland AI claims 1 million concurrent calls. Retell AI processes 4.2 million real calls for training data. Vapi handles high-frequency trading voice interfaces where microseconds matter.
At these scales, framework overhead matters. Garbage collection pauses matter. Memory allocation patterns matter. Network hop count matters.
These companies build custom orchestration:
Vapi's Predict-and-Scrap
Vapi's proprietary trick: start LLM inference before the user finishes speaking.
User is mid-sentence: "I want to reschedule my appoint--"
System:
1. STT partial: "I want to reschedule my"
2. LLM starts generating on partial transcript
3. LLM begins: "Sure, I can help you reschedule..."
User finishes: "--ment to next Thursday"
System:
4. STT final: "I want to reschedule my appointment to next Thursday"
5. SCRAP the in-progress LLM response
6. RESTART with full transcript
7. LLM generates with complete context: "I'll reschedule to next Thursday..."
Sometimes the prediction is right (the partial transcript was enough). The LLM response is already generating and TTS is already synthesizing. Response time: near zero.
Sometimes the prediction is wrong. The system scraps and restarts. Wasted compute. But even in the restart case, the LLM is "warmed up" — KV cache is partially populated, first tokens arrive faster.
Net result: 200-300ms latency reduction on average. The compute waste is worth it at their scale.
Bland AI's Edge Delivery
Bland processes audio at the nearest data center to the caller. Audio doesn't travel to a central server, get processed, and travel back. It hits an edge node 50ms away, gets processed locally, and returns.
Their infrastructure: dedicated GPU/CPU clusters at edge locations, custom audio processing (not relying on third-party STT APIs), proprietary orchestration with five-nines uptime.
Retell AI's Training Loop
Retell uses their 4.2 million real calls as training data. Their VAD, turn detection, and intent classification models are continuously fine-tuned on production data. 92% first-utterance intent accuracy — meaning the system correctly identifies what the caller wants from their very first sentence, 92% of the time.
This is a data flywheel. More calls -> better models -> better experience -> more calls.
The Real Stack: Indian Healthcare Voice Agent
If you're building a voice agent for Indian healthcare (IVF clinics, hospital chains, diagnostic centers), here's what the production stack looks like:
TELEPHONY
Exotel (Indian SIP trunking, local numbers)
OR Twilio (global, more expensive in India)
TRANSPORT
Daily.co WebRTC (via Pipecat)
OR LiveKit (if self-hosting for data residency)
VAD
Silero VAD (base)
+ custom threshold tuning for Hindi/Gujarati pause patterns
+ noise gate for clinic environments (AC hum, waiting room)
STT
Deepgram Nova-3
- $0.0077/min
- Keyterm prompting: "IVF", "beta-HCG", "embryo transfer",
"follicle", "estradiol", "trigger shot"
- Nova-3 Medical variant for clinical terminology
- Streaming mode (partial transcripts every 20ms)
LLM
GPT-4o-mini (routine calls: scheduling, reminders, intake)
GPT-4o or Claude Sonnet (complex triage: symptom assessment, escalation decisions)
- Dynamic model selection based on conversation state
- Tool calling for: appointment booking, record lookup, SMS, escalation
TTS
ElevenLabs Turbo v2.5
- <75ms time to first audio
- Hindi voice (female, warm, clinical-professional)
- 32 language support for multilingual patients
STATE
Custom state machine on Pipecat
- States: GREETING -> ID_VERIFICATION -> INTENT -> [SCHEDULING|LAB_RESULTS|MEDICATION|GENERAL] -> CONFIRMATION -> FAREWELL
- Each state: own system prompt, own tools, own escalation rules
- Session object in Redis: patient name, appointment, allergies, mood, call summary
CACHE
Redis for TTS audio caching
- Pre-synthesized: common greetings, confirmations, instructions
- Dynamic cache: frequently repeated sentences get cached after 3 occurrences
- Cache hit rate target: 60-80% of TTS output
BACKEND
Webhooks to clinic EMR/CRM after each call
- Call summary (LLM-generated)
- Appointment changes
- Escalation reports
- Patient sentiment score
COMPLIANCE
- Call recording with consent announcement
- No clinical advice from AI (hard rule in system prompts)
- All medical data stays in Indian data centers
- Audit log of every AI decision and tool call
The Critical Insight Nobody Mentions
This is an architecture decision that most blog posts and tutorials get wrong. They demo with WebRTC (high-quality audio) and conclude that speech-to-speech is superior. Then they deploy on phone lines and wonder why it sounds worse.
Build for how the call actually arrives. Not for the demo.
If your users call from a browser or an app (WebRTC), speech-to-speech wins on latency and emotional intelligence. If your users dial a phone number (PSTN), modular wins on accuracy and reliability.
IVF patients in India dial phone numbers. The architecture follows.
Summary: Choose Your Complexity
| Level | Stack | Build Time | Best For |
|---|---|---|---|
| Demo | Pipecat + Deepgram + GPT-4o + ElevenLabs | 1 weekend | Proving the concept |
| Production | Pipecat + custom VAD + state machine + TTS cache | 2-4 months | Single vertical (1 clinic chain) |
| Platform | LiveKit + custom orchestration + data flywheel | 6-12 months | Multi-tenant (serving many clinics) |
| Infrastructure | Full custom | 12+ months | You ARE the platform (Vapi/Retell/Bland) |
Most teams should start at level 2. Build the demo in a weekend. Spend 2-4 months on the customization that makes it production-grade for your specific domain.
The demo is not the hard part. The six months of tuning VAD for Hindi speakers, building state machines for IVF patient journeys, and caching 80% of TTS output — that's the product.
This is post 3 of a 4-part deep dive on voice AI engineering. Next: why voice AI still breaks in production — 7 failure modes and how to fix each one.