ten×
0
← back to blog
Pipecat, LiveKit, or Custom — How Voice AI Gets Built

Pipecat, LiveKit, or Custom — How Voice AI Gets Built

2025-12-23·11 min read·aivoiceengineeringframeworks

Three real options exist for building voice AI. Each makes different tradeoffs. Pipecat's frame pipeline, LiveKit's infrastructure model, and what a production Indian healthcare stack actually looks like.


Two posts ago: the pipeline (STT -> LLM -> TTS). Last post: the orchestration layer (VAD, turn-taking, barge-in, state management).

Now: how this actually gets built. What framework you use. What architecture decisions you make before writing a single line of business logic.

Three real options exist. Each makes different tradeoffs. Choosing wrong costs you months.

Option 1: Pipecat — The Open-Source Standard

Pipecat is an open-source Python framework for building real-time voice AI pipelines. Built by Daily.co, a company that provides real-time video and audio infrastructure.

It's what most voice AI startups build on. It's what companies customizing voice AI for specific verticals (healthcare, finance, support) start with. If you're building a voice agent today and you don't have a reason to build custom, you're probably using Pipecat.

The Core Abstraction: Everything Is a Frame

Pipecat models real-time audio processing as a pipeline of frames. A frame is a small chunk of data — audio, text, or a control signal — that flows through a series of processors.

# Simplified Pipecat pipeline
pipeline = Pipeline([
    transport.input(),           # Raw audio frames from the call
    vad_processor,               # Silero VAD: classifies speech/silence
    stt_processor,               # Deepgram: converts audio to text frames
    context_aggregator,          # Collects text into conversation context
    llm_processor,               # GPT-4o: generates response text frames
    tts_processor,               # ElevenLabs: converts text to audio frames
    transport.output()           # Sends audio frames back to the caller
])

Audio arrives as frames — small chunks, roughly 20ms each. Each processor in the pipeline consumes frames, does its work, and emits new frames. The processors run concurrently. When the STT processor emits a text frame, the LLM processor starts working on it immediately, even if STT is still processing the next audio chunk.

This is the abstraction that makes voice AI buildable by small teams. Instead of managing raw WebSocket connections, audio buffers, sample rates, and byte streams, you chain processors and let frames flow.

Frame Types

Not all frames are audio. Pipecat has a type system for frames:

  • AudioRawFrame — raw audio data (PCM samples)
  • TranscriptionFrame — partial or final transcript from STT
  • TextFrame — text chunks from LLM response
  • TTSAudioRawFrame — synthesized audio from TTS
  • UserStartedSpeakingFrame — control signal from VAD
  • UserStoppedSpeakingFrame — control signal from VAD
  • LLMFullResponseStartFrame — marks the beginning of LLM output
  • EndFrame — signals pipeline shutdown

Transport Layer

Pipecat doesn't handle telephony itself. It delegates to a transport — the layer that manages the actual audio connection.

The default transport is Daily.co's WebRTC infrastructure. Your voice agent runs on a server. The caller connects through a Daily.co room. Audio streams bidirectionally over WebRTC.

For phone calls (PSTN), you need a SIP trunk. Pipecat integrates with:

  • Twilio — most common, easiest setup
  • Telnyx — private network, better quality, cheaper at scale
  • Daily.co SIP — native integration, simplest with Pipecat

The SIP trunk bridges the phone network and WebRTC. Caller dials a regular number -> SIP trunk routes to Daily.co -> Daily.co streams to your Pipecat pipeline.

What Pipecat Does NOT Give You

Pipecat gives you plumbing. It does NOT give you:

1. Regional language VAD tuning. Silero VAD is the default. It works well on clean English. It does not handle Hindi pause patterns, Gujarati filler sounds, noisy Indian phone audio, or elderly callers with longer pauses. You must tune thresholds, retrain models, or build a secondary classifier.

2. Backchanneling / filler injection. There's no built-in "play 'hmm' while the LLM is thinking" processor. You build a custom processor that:

  • Detects when UserStoppedSpeakingFrame was emitted
  • Starts a timer
  • If LLM hasn't emitted its first token within 300ms, plays a cached filler sound
  • Stops the filler when LLM audio begins

This is maybe 100 lines of code. But it's 100 lines that reduce perceived latency by 50%.

3. State machines for conversation flow. Pipecat's LLM processor sends the full conversation history on every turn. For a long call, that's expensive and eventually exceeds context limits. You need to build:

  • Entity extraction after every few turns
  • External state storage
  • Dynamic system prompt that changes based on conversation phase
  • Token budget management

4. TTS caching. In many domains, 70-80% of sentences repeat across calls. "Your appointment is confirmed for..." "Please arrive 15 minutes before..." Instead of synthesizing these every time, you cache the audio and skip TTS entirely. Sub-100ms response for cached phrases.

What Real Customization Looks Like

Option 2: LiveKit — The Infrastructure Play

LiveKit takes a fundamentally different approach. Where Pipecat is a framework (you build pipelines), LiveKit is infrastructure (you build on top of rooms).

LiveKit started as an open-source alternative to Twilio and Daily.co for real-time communication. WebRTC infrastructure — rooms, participants, audio/video routing, recording. Then they added an Agents Framework on top: a Python/Node SDK for building AI participants that join rooms and interact with humans.

Architecture

LiveKit Server (rooms, audio routing, WebRTC, recording)
  -> LiveKit Agents Framework (Python/Node SDK)
    -> Your agent logic (STT + LLM + TTS + tools)

The LiveKit server handles all transport. Your agent is just another "participant" in a room. It receives audio from the human participant, processes it, and sends audio back.

What LiveKit Gives You That Pipecat Doesn't

Self-hostable infrastructure. Pipecat depends on Daily.co's cloud for WebRTC transport. LiveKit's server is open-source. You can run it on your own infrastructure. For healthcare companies with data residency requirements (patient audio cannot leave India/US), this matters.

Built-in room management. Multiple participants in a conversation. Record the call. Stream it. Add a human supervisor who can listen in. Add a second AI agent that handles a different language. All native to the platform.

Conversational turn detection. LiveKit's Agents Framework has built-in turn detection that goes beyond VAD. It considers:

  • Audio energy levels
  • Speech duration
  • Semantic signals (did the LLM's response end with a question?)
  • Configurable endpointing thresholds per language

Scale primitives. Distribute agents across GPU clusters. Load balance across regions. LiveKit Cloud (their hosted version) handles this. Self-hosted, you manage it but the primitives exist.

What Pipecat Gives You That LiveKit Doesn't

Simpler mental model. Frames in, frames out. No rooms, no participants, no server management. For a single voice agent answering calls, Pipecat is simpler.

Larger processor ecosystem. More pre-built integrations for STT/LLM/TTS providers. Community-contributed processors for specific use cases.

More community examples. Pipecat has been the default for voice AI builders longer. More tutorials, more example code, more Stack Overflow answers.

When to Choose Which

ScenarioChoose
Single voice agent, ship fastPipecat
Need to self-host everythingLiveKit
10K+ concurrent callsLiveKit
Building a platform (not just agents)LiveKit
Healthcare with data residencyLiveKit
Prototype / hackathonPipecat
Need recording, streaming, multi-participantLiveKit

Most startups start with Pipecat. Companies that outgrow it or need infrastructure control migrate to LiveKit or custom.

Option 3: Custom — What Platforms Do at Scale

Bland AI claims 1 million concurrent calls. Retell AI processes 4.2 million real calls for training data. Vapi handles high-frequency trading voice interfaces where microseconds matter.

At these scales, framework overhead matters. Garbage collection pauses matter. Memory allocation patterns matter. Network hop count matters.

These companies build custom orchestration:

Vapi's Predict-and-Scrap

Vapi's proprietary trick: start LLM inference before the user finishes speaking.

User is mid-sentence: "I want to reschedule my appoint--"

System:
  1. STT partial: "I want to reschedule my"
  2. LLM starts generating on partial transcript
  3. LLM begins: "Sure, I can help you reschedule..."

User finishes: "--ment to next Thursday"

System:
  4. STT final: "I want to reschedule my appointment to next Thursday"
  5. SCRAP the in-progress LLM response
  6. RESTART with full transcript
  7. LLM generates with complete context: "I'll reschedule to next Thursday..."

Sometimes the prediction is right (the partial transcript was enough). The LLM response is already generating and TTS is already synthesizing. Response time: near zero.

Sometimes the prediction is wrong. The system scraps and restarts. Wasted compute. But even in the restart case, the LLM is "warmed up" — KV cache is partially populated, first tokens arrive faster.

Net result: 200-300ms latency reduction on average. The compute waste is worth it at their scale.

Bland AI's Edge Delivery

Bland processes audio at the nearest data center to the caller. Audio doesn't travel to a central server, get processed, and travel back. It hits an edge node 50ms away, gets processed locally, and returns.

Their infrastructure: dedicated GPU/CPU clusters at edge locations, custom audio processing (not relying on third-party STT APIs), proprietary orchestration with five-nines uptime.

Retell AI's Training Loop

Retell uses their 4.2 million real calls as training data. Their VAD, turn detection, and intent classification models are continuously fine-tuned on production data. 92% first-utterance intent accuracy — meaning the system correctly identifies what the caller wants from their very first sentence, 92% of the time.

This is a data flywheel. More calls -> better models -> better experience -> more calls.

The Real Stack: Indian Healthcare Voice Agent

If you're building a voice agent for Indian healthcare (IVF clinics, hospital chains, diagnostic centers), here's what the production stack looks like:

TELEPHONY
  Exotel (Indian SIP trunking, local numbers)
  OR Twilio (global, more expensive in India)

TRANSPORT
  Daily.co WebRTC (via Pipecat)
  OR LiveKit (if self-hosting for data residency)

VAD
  Silero VAD (base)
  + custom threshold tuning for Hindi/Gujarati pause patterns
  + noise gate for clinic environments (AC hum, waiting room)

STT
  Deepgram Nova-3
  - $0.0077/min
  - Keyterm prompting: "IVF", "beta-HCG", "embryo transfer",
    "follicle", "estradiol", "trigger shot"
  - Nova-3 Medical variant for clinical terminology
  - Streaming mode (partial transcripts every 20ms)

LLM
  GPT-4o-mini (routine calls: scheduling, reminders, intake)
  GPT-4o or Claude Sonnet (complex triage: symptom assessment, escalation decisions)
  - Dynamic model selection based on conversation state
  - Tool calling for: appointment booking, record lookup, SMS, escalation

TTS
  ElevenLabs Turbo v2.5
  - <75ms time to first audio
  - Hindi voice (female, warm, clinical-professional)
  - 32 language support for multilingual patients

STATE
  Custom state machine on Pipecat
  - States: GREETING -> ID_VERIFICATION -> INTENT -> [SCHEDULING|LAB_RESULTS|MEDICATION|GENERAL] -> CONFIRMATION -> FAREWELL
  - Each state: own system prompt, own tools, own escalation rules
  - Session object in Redis: patient name, appointment, allergies, mood, call summary

CACHE
  Redis for TTS audio caching
  - Pre-synthesized: common greetings, confirmations, instructions
  - Dynamic cache: frequently repeated sentences get cached after 3 occurrences
  - Cache hit rate target: 60-80% of TTS output

BACKEND
  Webhooks to clinic EMR/CRM after each call
  - Call summary (LLM-generated)
  - Appointment changes
  - Escalation reports
  - Patient sentiment score

COMPLIANCE
  - Call recording with consent announcement
  - No clinical advice from AI (hard rule in system prompts)
  - All medical data stays in Indian data centers
  - Audit log of every AI decision and tool call

The Critical Insight Nobody Mentions

This is an architecture decision that most blog posts and tutorials get wrong. They demo with WebRTC (high-quality audio) and conclude that speech-to-speech is superior. Then they deploy on phone lines and wonder why it sounds worse.

Build for how the call actually arrives. Not for the demo.

If your users call from a browser or an app (WebRTC), speech-to-speech wins on latency and emotional intelligence. If your users dial a phone number (PSTN), modular wins on accuracy and reliability.

IVF patients in India dial phone numbers. The architecture follows.

Summary: Choose Your Complexity

LevelStackBuild TimeBest For
DemoPipecat + Deepgram + GPT-4o + ElevenLabs1 weekendProving the concept
ProductionPipecat + custom VAD + state machine + TTS cache2-4 monthsSingle vertical (1 clinic chain)
PlatformLiveKit + custom orchestration + data flywheel6-12 monthsMulti-tenant (serving many clinics)
InfrastructureFull custom12+ monthsYou ARE the platform (Vapi/Retell/Bland)

Most teams should start at level 2. Build the demo in a weekend. Spend 2-4 months on the customization that makes it production-grade for your specific domain.

The demo is not the hard part. The six months of tuning VAD for Hindi speakers, building state machines for IVF patient journeys, and caching 80% of TTS output — that's the product.

This is post 3 of a 4-part deep dive on voice AI engineering. Next: why voice AI still breaks in production — 7 failure modes and how to fix each one.