How AI Generates Images — Start with Noise, End with Art

2026-02-01·9 min read·aiimage-generationengineering

You type a prompt. Three seconds later, an image exists that has never existed before. No database was searched. The image was generated from pure mathematical noise. Here's how diffusion actually works.

You type "a golden retriever wearing a space suit, standing on Mars, photorealistic." Three seconds later, an image exists that has never existed before.

No database of dog photos was searched. No Photoshop template was filled in. The image was generated from scratch — from pure mathematical noise.

Here's how.

The Core Idea: Learn to Remove Noise

Take a photo. Add a tiny bit of random noise — like static on a TV. The photo is now slightly degraded but still recognizable. Add more noise. More. More. Keep going for 1,000 steps.

At step 1, the image is slightly noisy. At step 500, it's barely recognizable. At step 1,000, it's pure static. Random noise. No information left.

This is called the forward diffusion process. It's simple: take a real image, gradually destroy it with noise until nothing remains.

Now the interesting part: reverse the process.

Train a neural network to look at a noisy image and predict what the slightly-less-noisy version looks like. Show it a step-500 image and teach it to produce the step-499 version. Show it step-999 and teach it to produce step-998.

Do this for millions of images across all 1,000 noise levels. The network learns a general skill: given any noisy image, remove a small amount of noise.

Now chain 1,000 denoising steps together. Start with pure random noise (step 1,000). Apply the denoiser. Get step 999. Apply again. Step 998. All the way back to step 0.

What emerges is an image. Not one of the training images. A new image that the network "imagined" by gradually refining noise into structure.

Adding Text: How "Golden Retriever on Mars" Works

The denoiser on its own produces random images. To control what it generates, you need to condition it on text.

CLIP (Contrastive Language-Image Pre-training, by OpenAI) is the bridge between text and images. CLIP was trained on 400 million image-text pairs from the internet. It learned to encode both images and text into the same mathematical space.

In this shared space:

The text "a golden retriever" and a photo of a golden retriever are near each other
"a golden retriever" and a photo of a toaster are far apart
"photorealistic" shifts the position toward photographic textures
"oil painting" shifts it toward brush strokes and canvas textures

When you type a prompt, CLIP encodes it into a vector. This vector guides the denoising process. At each step, the denoiser doesn't just remove noise — it removes noise in the direction of your prompt. The network is nudged toward generating an image that, when encoded by CLIP, would be close to the text encoding.

The technical term is classifier-free guidance (CFG). At each denoising step, the model generates two predictions: one conditioned on your prompt, one unconditioned (no prompt). The final prediction is the conditioned prediction amplified away from the unconditioned one:

prediction = unconditioned + scale * (conditioned - unconditioned)

The scale parameter (CFG scale, typically 7-12) controls how strongly the text influences the image. Low scale = more creative but less faithful to prompt. High scale = more faithful but less diverse.

Latent Space: The Speed Trick

Running diffusion on full-resolution images is extremely expensive. A 1024x1024 image has over 3 million pixels. Running 1,000 denoising steps on 3 million values is computationally brutal.

Stable Diffusion's insight: Don't run diffusion in pixel space. Run it in latent space.

An autoencoder compresses images. The encoder takes a 1024x1024 image and compresses it into a 128x128 "latent" representation — 64 times smaller. The decoder takes the 128x128 latent and reconstructs the full-resolution image.

Pixel image (1024x1024) -> Encoder -> Latent (128x128) -> Decoder -> Pixel image (1024x1024)

The key: train the diffusion model to operate on the latent representation, not on pixels.

Random noise (128x128)
  -> 50 denoising steps in latent space
  -> Clean latent (128x128)
  -> Decoder
  -> Full-resolution image (1024x1024)

Same quality. 64x less computation per denoising step. And you only need ~50 steps instead of 1,000 (later research showed that fewer steps with better scheduling works fine).

This is why Stable Diffusion generates images in seconds on a consumer GPU. The actual diffusion happens in a compressed space. The decoder upscales at the end.

This approach — diffusion in latent space — is called a Latent Diffusion Model (LDM). Nearly every modern image generation model uses some variant of it.

The Denoiser Architecture: U-Net and Transformers

The neural network that does the denoising has evolved through two major architectures.

U-Net (Stable Diffusion 1.x, 2.x)

The U-Net is shaped like the letter U:

Input (noisy latent)
  -> Downsample (compress spatially, increase channels)
    -> Downsample
      -> Bottleneck (smallest spatial size, most channels)
    -> Upsample (expand spatially, decrease channels)
  -> Upsample
-> Output (denoised latent, same size as input)

At each level, the network processes the image at a different resolution. Low-resolution levels capture global structure (overall composition, large shapes). High-resolution levels capture fine details (textures, edges, small features).

Skip connections link each downsample level to its corresponding upsample level. This preserves fine-grained details that would otherwise be lost during compression.

Text conditioning enters through cross-attention layers inserted at each level. The denoiser attends to the CLIP text embedding at every resolution, ensuring the text guides both global composition and fine details.

DiT: Diffusion Transformers (Flux, DALL-E 3, newer models)

Newer models replace the U-Net with a transformer — the same architecture LLMs use.

The latent image is divided into patches (small squares, like tiles). Each patch becomes a token. Text becomes tokens. All tokens — image patches and text — feed into a standard transformer with self-attention.

Why this matters:

Transformers scale better with more compute. U-Nets hit diminishing returns.
Unified architecture: the same attention mechanism handles both spatial relationships (image) and semantic relationships (text).
Training is more stable at large scale.

Flux, DALL-E 3, and Imagen 3 all use DiT architectures. Stable Diffusion's SDXL was the last major U-Net model. The field has moved on.

The Generation Process Step by Step

Let's walk through what happens when you type "a golden retriever in a space suit on Mars, photorealistic":

1. TEXT ENCODING
   Your prompt -> CLIP text encoder -> text embedding vector (768 dimensions)

2. NOISE INITIALIZATION
   Random noise sampled from a Gaussian distribution
   Shape: 128x128x4 (latent space dimensions)

3. DENOISING LOOP (50 steps)
   For step = 50 to 1:
     a. Scheduler determines current noise level
     b. Denoiser network (U-Net or DiT) predicts the noise in the current image
        - Input: noisy latent + timestep + text embedding
        - Output: predicted noise
     c. Subtract predicted noise from current image
        (guided by text embedding via CFG)
     d. Result: slightly cleaner latent

4. DECODE
   Clean latent (128x128x4) -> VAE decoder -> pixel image (1024x1024x3)

5. OUTPUT
   A golden retriever in a space suit on Mars appears on your screen.

Total time: 2-10 seconds on a modern GPU. The 50 denoising steps are the computational bottleneck.

Schedulers: Not All 50 Steps Are Equal

The scheduler (also called sampler or noise schedule) determines how much noise to remove at each step.

Early steps (high noise) make big structural decisions: Is this a dog or a cat? Is the background red or blue? Where is the subject positioned?

Late steps (low noise) refine details: fur texture, lighting reflections, small features.

Different schedulers distribute the denoising budget differently:

Scheduler	Characteristic	Best For
DDPM	Original, slow (1000 steps)	Research
DDIM	Deterministic, fewer steps (50)	Consistent outputs
Euler	Fast, good quality at 20-30 steps	General use
DPM++ 2M Karras	Fast convergence, sharp details	Production default
LCM	Ultra-fast (4-8 steps)	Real-time applications

Why AI Still Fails at Certain Things

Hands and Fingers

Diffusion models struggle with hands because:

Hands are geometrically complex — 5 fingers, multiple joints, varying poses
Hands are small relative to the image — they occupy few latent space pixels
Training data contains hands in wildly different poses, making the "average hand" a mess
The model doesn't understand anatomy — it generates whatever pattern statistically fits

Newer models (DALL-E 3, Imagen 3, Flux Pro) have improved significantly through larger training datasets with better hand representation and architectural improvements. But it remains the hardest body part to generate correctly.

Text in Images

Diffusion models generate images holistically — they don't have a concept of "this region is text that must be readable." Text requires pixel-precise letter shapes. The latent space compression loses this precision.

DALL-E 3 partly solved this by using a different text encoder (T5 instead of CLIP) and training specifically on images with text. Flux and Ideogram also handle text significantly better than earlier models. But long text passages remain unreliable.

Spatial Consistency

"A blue cube on top of a red sphere to the left of a green pyramid."

Understanding spatial relationships (on top of, to the left of, behind) requires compositional understanding that CLIP's text embedding struggles to capture. The text embedding represents the prompt as a single vector — it doesn't decompose spatial relationships explicitly.

What Happens at Scale: Cost and Infrastructure

Running diffusion models requires GPUs. Generating one image:

Model	GPU Required	Time per Image	Cost per Image
Stable Diffusion (local)	RTX 3090 or better	3-10 seconds	~$0.001 (electricity)
DALL-E 3 (API)	OpenAI's infra	5-15 seconds	$0.04-0.12
Midjourney	Their infra	10-60 seconds	~$0.01-0.03 (subscription)
Flux Pro (API)	fal.ai / Replicate	3-8 seconds	$0.03-0.055
Imagen 3 (API)	Google Cloud	5-10 seconds	$0.03-0.04

For production applications (e-commerce product images, marketing at scale, game assets), the per-image cost matters. Self-hosting Stable Diffusion or Flux on your own GPU is 10-100x cheaper than API calls at volume.

The One-Line Summary

Image generation is noise removal guided by text. Start with static. Remove noise step by step, guided by a text embedding that tells the denoiser what to aim for. Do it in compressed (latent) space for speed. Decode to pixels at the end.

The magic is not in any one step. It's in the fact that a network trained to remove noise, conditioned on text, produces coherent images that have never existed before.

This is post 9 of the AI Engineering Explained series.

Next post: The Models — DALL-E, Midjourney, Flux, Imagen. Same underlying mechanism, very different outputs. What's actually different and why.

← More posts Home