Fine-Tuning Image AI — LoRA, ControlNet, and IP-Adapter

2026-02-14·8 min read·aiimage-generationengineeringfine-tuning

The base model doesn't know your face, your brand, or your product. LoRA teaches it new concepts. ControlNet guides composition. IP-Adapter transfers style. Together, they give you precise control.

The base model generates beautiful images. But it doesn't know your face. It doesn't know your brand style. It can't generate your product consistently. It can't follow a specific composition you have in mind.

Fine-tuning and control mechanisms close this gap. Here's how each one works and when to use it.

LoRA: Teaching the Model New Concepts

LoRA (Low-Rank Adaptation) is a technique for fine-tuning a model on new data without retraining the entire model.

The Problem

A diffusion model has hundreds of millions (or billions) of parameters. Fine-tuning all of them on your 20 photos of a specific face would:

Take days on expensive GPUs
Overfit catastrophically (the model would only generate that face)
Destroy the model's general capabilities

The Solution

Instead of changing all parameters, LoRA adds small, trainable "adapter" matrices alongside the existing weights. These adapters capture the new concept (your face, your brand style, a specific art style) while the base model remains frozen.

Original weight matrix: W (1024 x 1024 = 1,048,576 parameters)
LoRA adapter: A (1024 x 4) * B (4 x 1024) = 8,192 parameters

During generation:
  output = W * input + A * B * input

The rank (4 in this example) determines the adapter's capacity. Low rank = less expressive but faster to train. High rank = more expressive but more data needed.

Training a Face LoRA

To train a LoRA that generates your face:

Collect 10-25 photos. Different angles, different lighting, different expressions. No filters. No other people in frame. Square crop, 512x512 or 1024x1024.
Caption each image. Either manually ("a photo of sks man, brown skin, short black hair, wearing a blue shirt") or automatically (BLIP-2 captioning). The trigger word ("sks" is convention) is how you'll invoke the LoRA later.
Train. Typical settings:
- Base model: Flux Dev or SDXL
- Steps: 500-2,000
- Learning rate: 1e-4
- Rank: 8-32
- Time: 15-45 minutes on an A100 GPU
Generate. In your prompt, include the trigger word: "a photo of sks man in a business suit, studio lighting." The LoRA activates and generates your face.

LoRA Composition

You can apply multiple LoRAs simultaneously:

Face LoRA (your identity) + Style LoRA (oil painting) + Pose LoRA (specific body position)

Each LoRA has a weight (0 to 1) that controls its influence. Face at 0.8, style at 0.5 = your face in a softened art style.

This composability is why LoRA became the standard for image customization. It's modular. Train each concept once. Combine at generation time.

When LoRA Fails

ControlNet: Guiding Composition

LoRA teaches the model new concepts. ControlNet tells the model where to put things.

The Problem

Diffusion models are good at generating images but bad at following spatial instructions. "A person standing on the left, looking right, with a city behind them" might generate the person in the center, looking forward, with a forest.

The Solution

ControlNet takes a spatial guide — an edge map, a depth map, a pose skeleton, a segmentation mask — and uses it to control the spatial layout of the generated image.

Input: Text prompt + Control image (e.g., pose skeleton)
Output: Generated image that follows the pose AND the prompt

Control Types

Canny Edge: Extract edges from a reference image. The generated image follows the same edge structure. Use for: maintaining composition from a sketch or reference photo.

Depth Map: Estimate the depth (near/far) of objects in a reference. The generated image places objects at the same depths. Use for: architectural visualization, interior design.

OpenPose: Extract human body keypoints (joints, skeleton) from a reference. The generated image places a person in the same pose. Use for: character consistency, specific body positions.

Segmentation: Label regions (sky, ground, person, building) in a reference. The generated image fills each region with appropriate content. Use for: scene composition.

Normal Map: Surface orientation information. Use for: 3D-looking renders, product visualization.

How It Works Technically

ControlNet adds a trainable copy of the encoder portion of the U-Net (or DiT). This copy processes the control image. Its outputs are added to the original model's intermediate features at each layer.

Original model:  input -> encoder -> bottleneck -> decoder -> output
                           |                          ^
ControlNet:      control -> encoder_copy -------------|

The control signal is injected at multiple resolution levels, influencing both global composition and local details. The original model weights are frozen. Only the ControlNet encoder copy is trained.

This architecture means ControlNet doesn't change the base model. It works as a plug-in. You can use it or not. Different ControlNets for different control types, same base model.

IP-Adapter: Style Transfer by Example

LoRA trains a concept into the model. ControlNet guides spatial layout. IP-Adapter transfers the style of a reference image to the generation.

The Problem

You have an image whose style you love — the color palette, the lighting, the mood, the texture. You want to generate new images in that same style. But describing a style in text is imprecise. "Warm, moody, cinematic" could mean a thousand things.

The Solution

IP-Adapter (Image Prompt Adapter) takes a reference image and uses it as a style guide. Instead of (or alongside) text conditioning, it conditions the diffusion model on the visual features of the reference image.

Input: Text prompt + Reference image (for style)
Output: Generated image with the CONTENT of the prompt and the STYLE of the reference

How It Works

IP-Adapter uses a CLIP image encoder to extract visual features from the reference image. These features are injected into the diffusion model's cross-attention layers alongside the text embedding.

Text prompt -> CLIP text encoder -> text features
Reference image -> CLIP image encoder -> image features

Both features -> Cross-attention in denoiser -> Generation guided by text AND image

The model attends to both the text (what to generate) and the reference image (how it should look). A weight parameter controls the balance between text adherence and style adherence.

Practical Applications

Brand consistency: Generate marketing images that all match a brand's visual identity. Use one "brand reference image" as the IP-Adapter input across all generations.

Character consistency across scenes: Generate a character in different scenes using a reference image of that character. The character's visual identity carries across, while the scene changes based on the text prompt. Not perfect, but significantly better than text-only prompting.

Art direction from mood boards: A designer selects a mood board image. IP-Adapter transfers that mood to all generated assets.

The Consistency Problem

The hardest unsolved problem in image generation: maintaining a consistent character, object, or style across multiple images.

If you generate "a woman with red hair in a coffee shop" five times, you get five different women. Same prompt, different face every time.

Putting It All Together

A real production workflow might combine all three:

1. Train a LoRA on 20 photos of the product (15 min, once)
2. Create a ControlNet depth map for the desired composition
3. Use an IP-Adapter with a brand reference image for style consistency
4. Generate with: LoRA (product identity) + ControlNet (layout) + IP-Adapter (style)

Result: a product image that shows YOUR product, in the composition YOU specified, in YOUR brand style. From text prompt to final image in 5 seconds.

This is the workflow that eliminates the $5,000 product shoot. Not for hero images (where human photography still wins), but for the 200 variant images needed for e-commerce listings, social media, email campaigns.

Summary of Control Mechanisms

Mechanism	What It Controls	Training Required	At Generation Time
LoRA	Identity (face, object, concept)	Yes (15-45 min)	Apply as adapter
ControlNet	Spatial layout (pose, edges, depth)	Pre-trained (download)	Provide control image
IP-Adapter	Style (colors, mood, aesthetic)	Pre-trained (download)	Provide reference image
CFG Scale	Prompt adherence strength	No	Set parameter (7-12)
Seed	Reproducibility	No	Set specific seed number
Negative prompt	What to avoid	No	List unwanted elements

This completes the image generation block. This is post 11 of the AI Engineering Explained series.

Next post: How AI Makes Video — from static frames to temporal coherence, DiT architectures, and why video is 100x harder than images.

← More posts Home