From Static Images to Moving Pictures — How Video AI Actually Works

2026-02-21·9 min read·aivideo-generationengineering

Image generation creates a single moment. Video generation creates time. A 5-second video is 120 frames that must be spatially AND temporally coherent. Here's why video is 100x harder than images.

Image generation creates a single moment. Video generation creates time.

That sounds simple. It's not. A 5-second video at 24 fps is 120 images that must be spatially coherent (each frame looks right) AND temporally coherent (the transition between frames looks right). Getting one frame right is the image generation problem. Getting 120 frames to flow naturally into each other is a different problem entirely.

Why Video Is 100x Harder Than Images

The Temporal Coherence Problem

Generate 120 images from the same prompt. Each image is individually good. Play them in sequence. It looks like a hallucinating fever dream — objects appear, disappear, change shape, teleport.

A ball rolling across a table needs to:

Maintain its shape across all 120 frames
Follow a physically plausible trajectory
Cast a shadow that moves with it
Not duplicate or vanish
Not clip through the table

None of this is guaranteed by generating independent images. Temporal coherence — consistency across time — must be explicitly modeled.

The Physics Problem

Video is not just moving pictures. It's physics. When a cup falls off a table:

It accelerates downward (gravity)
It rotates (angular momentum)
Liquid inside follows fluid dynamics
It shatters on contact (material properties)
Pieces scatter according to mass and velocity

Image generation doesn't model physics. It generates plausible-looking static scenes. Video generation must produce frame sequences where objects obey (or at least approximate) physical laws. Without this, every motion looks "off" — the uncanny valley of motion.

The Compute Problem

This is why video generation is slow and expensive. A 5-second clip might take 2-5 minutes on an A100. A 60-second clip might be impractical on current hardware.

The Architecture: DiT for Video

Modern video generation models use the Diffusion Transformer (DiT) architecture adapted for video.

Video as a 3D Latent

An image latent is 2D: height x width. A video latent adds a third dimension: time.

Image latent:  H x W x C     (height x width x channels)
Video latent:  T x H x W x C (time x height x width x channels)

The VAE (autoencoder) compresses each frame individually, then stacks them along the time axis. A 120-frame video at 1024x1024 becomes a 120x128x128x4 latent tensor.

Some models use a temporal VAE that also compresses along the time axis — reducing 120 frames to 30 temporal "keyframes" in latent space. This further reduces compute.

Temporal Attention

The key architectural addition for video: temporal attention layers.

In image DiT, each patch token attends to all other patch tokens within the same image. This captures spatial relationships (the ball is next to the cup).

In video DiT, attention happens across both space AND time:

Spatial attention: Within each frame, patches attend to other patches. Same as image generation. "What is next to what?"
Temporal attention: Across frames, each patch attends to the same spatial position in other frames. "How does this patch change over time?" A patch containing the ball in frame 30 attends to the same patch in frames 1-120, learning the ball's trajectory.
Cross-attention: Text embedding guides both spatial and temporal generation. The prompt "a ball rolling left" conditions the temporal attention to produce left-ward motion.

These three attention mechanisms interleave across many layers. The model simultaneously learns what each frame should look like (spatial), how frames relate to each other (temporal), and how both respond to the text prompt (cross-attention).

The Generation Process

1. Encode text prompt via T5/CLIP -> text embedding
2. Initialize random noise: T x H x W x C (video latent)
3. Denoise over 50-100 steps:
   a. Spatial attention (within each frame)
   b. Temporal attention (across frames)
   c. Cross-attention (with text)
   d. Remove predicted noise
4. Decode each frame latent back to pixels via VAE
5. Assemble frames into video

The denoising process shapes both appearance (spatial) and motion (temporal) simultaneously. Early steps establish global motion patterns (ball moves left). Middle steps add detail (ball has stripes). Late steps refine (lighting changes as ball moves, shadow follows).

Text-to-Video vs Image-to-Video vs Video-to-Video

Text-to-Video

Input: text prompt. Output: video.

This is the hardest generation task. The model must imagine both appearance and motion from scratch. Results are impressive but limited: typically 2-10 seconds, with degrading quality on longer clips.

Image-to-Video

Input: a reference image + text prompt. Output: video starting from that image.

Much easier. The first frame is given, so the model only needs to imagine motion forward. This produces better quality and longer clips because the model has a strong anchor (the reference frame).

This is what Kling, Runway, and Pika excel at. You provide a static image (possibly AI-generated), and the model animates it. The $34 avatar pipeline we built at ten× uses this: generate a static avatar image with Flux, then animate it with Kling's image-to-video.

Video-to-Video

Input: an existing video + text prompt. Output: transformed video.

The source video provides all temporal structure (motion, timing). The model only transforms the appearance. "Make this video look like a watercolor painting" or "Change the person's outfit to a suit."

This is technically the easiest because motion is already solved. But quality varies — maintaining identity and detail across frames while changing style is still challenging.

The Models

Sora (OpenAI)

The model that started the conversation. Announced February 2024 with jaw-dropping demos.

Duration: Up to 60 seconds (originally; practical outputs are 5-20 seconds)
Resolution: Up to 1080p
Architecture: DiT with spatial-temporal attention
Innovation: "World simulator" framing. OpenAI positioned Sora not as a video generator but as a model that understands physics and world dynamics. The demos showed realistic water physics, light refraction, and object permanence.
Reality: Quality is excellent for short clips. Longer videos show coherence degradation. Physics understanding is statistical, not actual simulation — it approximates what physics looks like from training data.
Access: ChatGPT Plus subscription. API available.

Kling (Kuaishou)

Chinese-made model that surprised the industry with quality rivaling Sora.

Duration: Up to 10 seconds (extended via stitching)
Resolution: 1080p
Architecture: DiT variant with strong motion modeling
Strength: Motion quality. Characters walking, dancing, and interacting with objects look significantly more natural than other models. Their motion model was trained on Kuaishou's massive short-video dataset (think: Chinese TikTok).
Access: API via fal.ai, Replicate. Direct via their platform.
Lip sync: Kling offers avatar lip-sync (image + audio -> talking head video). This is what the $34 avatar pipeline uses.

Veo (Google)

Google's video generation model, integrated with the Gemini ecosystem.

Duration: Up to 8 seconds (Veo 2), extending in newer versions
Resolution: 1080p, pushing to 4K
Strength: Cinematic quality. Google trained on high-quality video data. Outputs have a filmic quality that others lack.
Access: Vertex AI API, integrated into Google's AI Studio
Veo 3: Upcoming version with native audio generation (generates both video AND matching sound effects). This is a significant leap.

Runway Gen-3

The creative tooling company. Runway built the earliest usable video generation tools and continues to focus on creative workflows.

Duration: 5-10 seconds
Strength: Creative control. Runway offers motion brushes (paint where motion should happen), camera controls (pan, zoom, dolly), and reference image inputs. It's the most controllable video generation tool.
Access: Web app + API

Wan (Alibaba)

Open-source video generation model.

Strength: Open weights. You can self-host and fine-tune. Community ecosystem building.
Access: Open weights on Hugging Face

The Stitching Problem: Making Long Videos

All current models generate 5-10 seconds maximum in a single pass. For longer videos, you stitch clips.

Clip 1 (5 sec): generated from text
Clip 2 (5 sec): last frame of clip 1 -> image-to-video
Clip 3 (5 sec): last frame of clip 2 -> image-to-video
...

Each clip uses the previous clip's last frame as its starting image. This maintains visual continuity.

Mitigations:

Overlap: generate clips with 0.5-1 second overlap and blend the overlapping frames
Style anchoring: provide the same reference image as IP-Adapter input for every clip
Quality filtering: generate multiple clip candidates and select the most consistent one

Professional AI video production (music videos, short films, ads) typically generates 20-30 candidate clips and manually selects the best 10-15 to assemble the final video. The generation is cheap. The curation is the craft.

The Compute Reality

Video generation is expensive:

Model	Time for 5-sec clip	Cost per clip
Sora (via ChatGPT)	30-120 sec	Included in subscription
Kling (via fal.ai)	60-180 sec	$0.10-0.50
Veo (via API)	30-90 sec	$0.10-0.30
Runway (via web)	30-60 sec	$0.25-0.50
Wan (self-hosted)	120-300 sec	GPU cost only

A 1-minute video assembled from twelve 5-second clips costs $1-6 in generation. With candidate selection (generating 3x candidates), multiply by 3.

Self-hosting is possible but requires serious hardware. A single A100 (80GB VRAM) handles most models. Faster generation needs multiple GPUs or optimized inference servers.

Where Video Gen Is Going

Native audio: Veo 3 generates synchronized sound effects with the video. This eliminates the separate foley step.

Longer duration: Research is pushing toward 30-60 second single-pass generation. Current 5-10 second limits are engineering constraints, not architectural ones.

Camera control: Specify camera movements (dolly in, pan left, crane shot) as part of the prompt. Runway leads here.

Physics simulation: Models that actually understand physics (not just approximate it). This would produce videos where objects interact realistically — water flows correctly, cloth drapes correctly, gravity works correctly.

Real-time generation: LCM-style distillation for video. Generate frames fast enough to render in real-time. This would enable interactive AI-generated environments — effectively, AI-generated video games.

This is post 12 of the AI Engineering Explained series.

Next post: The Players — Sora, Kling, Veo, Runway in depth. Architectural differences, when to use which, and how to build production video pipelines.

← More posts Home