Sora, Kling, Veo, Runway — When to Use Which

2026-02-25·7 min read·aivideo-generationengineeringmodels

Five video generation models, five different strengths. Practical decisions — which model for which job, how to build a production pipeline, and the real costs per clip.

Last post: how video generation works. DiT architecture, temporal attention, latent video compression, stitching.

This post: practical decisions. Which model for which job. How to build a production pipeline. Real costs.

The Models Side by Side

	Sora	Kling	Veo	Runway Gen-3	Wan
Maker	OpenAI	Kuaishou	Google	Runway	Alibaba
Duration	5-20s	5-10s	5-8s	5-10s	5s
Resolution	1080p	1080p	1080p-4K	1080p	720p-1080p
Best at	Overall quality	Motion, lip sync	Cinematic look	Creative control	Self-hosting
Worst at	Availability, cost	Long scenes	Long content	Photo realism	Quality ceiling
API	OpenAI API	fal.ai, direct	Vertex AI	Runway API	Self-host
Open weights	No	No	No	No	Yes
Lip sync	No	Yes (Avatar)	No	No	No
Cost/5s clip	~$0.50	$0.10-0.50	$0.10-0.30	$0.25-0.50	GPU cost

When to Use Which

Sora: The Quality Benchmark

Use when: You need the highest overall quality and don't mind paying for it. Marketing videos, product demos, social media content where quality is the priority.

Don't use when: You need lip sync, avatar generation, or programmatic access at scale. Sora is best as a creative tool, not a production pipeline component.

Technical notes: Sora uses a DiT architecture trained on video and image data jointly. It processes variable-length sequences of spacetime patches, allowing it to generate different resolutions and aspect ratios natively. OpenAI claims it models "the physical world" — in practice, it approximates physics well for common scenarios (water, cloth, walking) but fails on uncommon ones (complex mechanical systems, unusual physics).

Kling: The Motion and Lip Sync Specialist

Use when: Motion quality matters. Character movement, dancing, walking, facial expressions. AND when you need lip sync (avatar generation from image + audio).

Don't use when: You need very long clips (stitching quality degrades faster than Sora). Or when you need Western-specific cultural references (training data is Chinese video-centric).

Technical notes: Kling's motion quality comes from being trained on Kuaishou's massive short-video platform data. Billions of short videos of real people doing real things. This gives it significantly better motion priors than models trained on curated film/stock footage.

The Avatar lip-sync feature is a separate model: input image + audio file, output talking head video. Quality is production-grade for marketing content. Not yet at "deepfake indistinguishable from real" level, but close.

Veo: The Cinematic Engine

Use when: You want filmic, cinematic quality. Short-film quality aesthetic. Google ecosystem integration.

Don't use when: You need lip sync (not available). Or when you need very long content (same 5-8 second limit as others).

Technical notes: Veo 2 is the current production version. Veo 3 (in preview) adds native audio generation — the model produces synchronized sound effects with the video. This is a significant differentiator. Instead of generating video and then adding sound separately, Veo 3 produces both in one pass.

Google's advantage: they can train on YouTube's entire video corpus (with appropriate licensing). This gives Veo a breadth of training data that other models can't match.

Runway Gen-3: The Creative Control Tool

Use when: You need precise control over the generation. Camera movements, motion regions, style references. Creative professionals who need predictability over surprise.

Don't use when: You need API-scale production (pricing is steep for volume). Or when photorealism is the priority (Runway skews more artistic/stylized).

Technical notes: Runway's differentiator is the tooling, not just the model. Motion Brush lets you paint motion onto specific regions of the image. Camera controls let you specify dolly, pan, crane shots. Gen-3 Alpha Turbo is a faster variant for iteration.

Runway was one of the co-creators of Stable Diffusion's original research. Their team has deep diffusion model expertise.

Wan: The Open-Source Option

Use when: You need to self-host. You want to fine-tune a video model on your own data. You have GPU infrastructure. Cost at scale matters more than peak quality.

Don't use when: You need highest quality out of the box. Or when you don't have GPU infrastructure for inference.

Technical notes: Wan is from Alibaba's research lab. Open weights on Hugging Face. Multiple model sizes. The community is building tools around it (ComfyUI nodes, fine-tuning scripts). This is the Stable Diffusion of video — not the best quality, but the most accessible and customizable.

Building a Production Video Pipeline

A production pipeline isn't "call one API and get a video." It's an orchestrated system with multiple models, quality checks, and human curation.

Architecture

SCRIPT
  Text script -> Claude/GPT-4o (break into scenes)
  Each scene: description, duration, camera angle, mood

SCENE GENERATION
  For each scene:
    1. Generate static keyframe (Flux Pro via fal.ai)
    2. Generate video from keyframe (Kling image-to-video)
    3. Generate 2-3 candidates per scene
    4. Quality filter: discard clips with artifacts

ASSEMBLY
  ffmpeg: trim, crossfade between clips, add transitions
  Audio: background music, sound effects, voiceover
  Final: color grade, titles, export

COST PER MINUTE
  Keyframes: 12 x $0.05 = $0.60
  Video clips: 12 x 3 candidates x $0.30 = $10.80
  Audio: $1-2 (TTS or licensed music)
  Total: ~$12-14 per minute of finished video

Quality Filtering

Not all generated clips are usable. Common artifacts:

Morphing: Object shapes change mid-clip
Teleportation: Objects jump position between frames
Flickering: Rapid brightness changes
Ghost limbs: Extra arms, fingers, or body parts appearing briefly
Text corruption: Any text in the scene becomes garbled

Avatar-Specific Pipeline

For talking-head videos (the most common business use case):

1. CHARACTER
   Reference photo OR Flux-generated character
   Consistent face across all clips (via seed + same reference image)

2. VOICE
   Clone a voice: MiniMax, ElevenLabs
   Generate speech from script: TTS API
   Output: audio file per scene

3. LIP SYNC
   Input: character image + audio
   Model: Kling Avatar (fal.ai)
   Output: talking head video

4. POST-PRODUCTION
   ffmpeg: stitch clips, add captions (whisper transcription)
   PIL/ImageMagick: branded lower-thirds, intro/outro cards
   Background: AI-generated or stock

5. OUTPUT
   1080p MP4, 16:9 (YouTube) or 9:16 (Reels/Shorts)

This pipeline produces videos that look like a real person (or convincing character) delivering a presentation. At $1-2 per clip, you can produce 10 videos a day.

The Emerging Capabilities

Audio Generation

Veo 3 generates synchronized audio. This eliminates the separate sound design step for many use cases. The model understands that a bouncing ball should produce impact sounds, that water should splash, that footsteps should match walking pace.

This is not just convenience. It changes the cost equation. Sound design is traditionally expensive and time-consuming. Automated sound generation could reduce video production costs by 20-30%.

Camera Control

Runway leads here, but others are following. Specifying camera movements:

Dolly in/out (move camera closer/farther)
Pan left/right (rotate camera horizontally)
Crane up/down (raise/lower camera)
Orbit (circle around subject)

This gives directors/creators precise control over the visual language. A dolly-in creates intimacy. A crane shot creates grandeur. These are not just technical movements — they're storytelling tools.

Consistency Across Clips

The biggest pain point in multi-clip videos: the character looks slightly different in each clip. Hair changes. Clothing shifts. Face morphs.

The Cost Trajectory

Year	Cost for 1-min talking head video	Quality level
2023	$5,000-10,000 (production crew)	Professional
2024	$500-1,000 (AI + heavy editing)	Acceptable
2025	$34-50 (AI pipeline, light editing)	Good
2026	$10-15 (automated pipeline)	Very good

This completes the video generation block. Next block: AI Agents — the loop, the tools, and the memory that turns a chatbot into something that actually does things.

← More posts Home