
Fine-Tuning Image AI — LoRA, ControlNet, and IP-Adapter
The base model doesn't know your face, your brand, or your product. LoRA teaches it new concepts. ControlNet guides composition. IP-Adapter transfers style. Together, they give you precise control.
The base model generates beautiful images. But it doesn't know your face. It doesn't know your brand style. It can't generate your product consistently. It can't follow a specific composition you have in mind.
Fine-tuning and control mechanisms close this gap. Here's how each one works and when to use it.
LoRA: Teaching the Model New Concepts
LoRA (Low-Rank Adaptation) is a technique for fine-tuning a model on new data without retraining the entire model.
The Problem
A diffusion model has hundreds of millions (or billions) of parameters. Fine-tuning all of them on your 20 photos of a specific face would:
- Take days on expensive GPUs
- Overfit catastrophically (the model would only generate that face)
- Destroy the model's general capabilities
The Solution
Instead of changing all parameters, LoRA adds small, trainable "adapter" matrices alongside the existing weights. These adapters capture the new concept (your face, your brand style, a specific art style) while the base model remains frozen.
Original weight matrix: W (1024 x 1024 = 1,048,576 parameters)
LoRA adapter: A (1024 x 4) * B (4 x 1024) = 8,192 parameters
During generation:
output = W * input + A * B * input
The rank (4 in this example) determines the adapter's capacity. Low rank = less expressive but faster to train. High rank = more expressive but more data needed.
Training a Face LoRA
To train a LoRA that generates your face:
-
Collect 10-25 photos. Different angles, different lighting, different expressions. No filters. No other people in frame. Square crop, 512x512 or 1024x1024.
-
Caption each image. Either manually ("a photo of sks man, brown skin, short black hair, wearing a blue shirt") or automatically (BLIP-2 captioning). The trigger word ("sks" is convention) is how you'll invoke the LoRA later.
-
Train. Typical settings:
- Base model: Flux Dev or SDXL
- Steps: 500-2,000
- Learning rate: 1e-4
- Rank: 8-32
- Time: 15-45 minutes on an A100 GPU
-
Generate. In your prompt, include the trigger word: "a photo of sks man in a business suit, studio lighting." The LoRA activates and generates your face.
LoRA Composition
You can apply multiple LoRAs simultaneously:
- Face LoRA (your identity) + Style LoRA (oil painting) + Pose LoRA (specific body position)
Each LoRA has a weight (0 to 1) that controls its influence. Face at 0.8, style at 0.5 = your face in a softened art style.
This composability is why LoRA became the standard for image customization. It's modular. Train each concept once. Combine at generation time.
When LoRA Fails
ControlNet: Guiding Composition
LoRA teaches the model new concepts. ControlNet tells the model where to put things.
The Problem
Diffusion models are good at generating images but bad at following spatial instructions. "A person standing on the left, looking right, with a city behind them" might generate the person in the center, looking forward, with a forest.
The Solution
ControlNet takes a spatial guide — an edge map, a depth map, a pose skeleton, a segmentation mask — and uses it to control the spatial layout of the generated image.
Input: Text prompt + Control image (e.g., pose skeleton)
Output: Generated image that follows the pose AND the prompt
Control Types
Canny Edge: Extract edges from a reference image. The generated image follows the same edge structure. Use for: maintaining composition from a sketch or reference photo.
Depth Map: Estimate the depth (near/far) of objects in a reference. The generated image places objects at the same depths. Use for: architectural visualization, interior design.
OpenPose: Extract human body keypoints (joints, skeleton) from a reference. The generated image places a person in the same pose. Use for: character consistency, specific body positions.
Segmentation: Label regions (sky, ground, person, building) in a reference. The generated image fills each region with appropriate content. Use for: scene composition.
Normal Map: Surface orientation information. Use for: 3D-looking renders, product visualization.
How It Works Technically
ControlNet adds a trainable copy of the encoder portion of the U-Net (or DiT). This copy processes the control image. Its outputs are added to the original model's intermediate features at each layer.
Original model: input -> encoder -> bottleneck -> decoder -> output
| ^
ControlNet: control -> encoder_copy -------------|
The control signal is injected at multiple resolution levels, influencing both global composition and local details. The original model weights are frozen. Only the ControlNet encoder copy is trained.
This architecture means ControlNet doesn't change the base model. It works as a plug-in. You can use it or not. Different ControlNets for different control types, same base model.
IP-Adapter: Style Transfer by Example
LoRA trains a concept into the model. ControlNet guides spatial layout. IP-Adapter transfers the style of a reference image to the generation.
The Problem
You have an image whose style you love — the color palette, the lighting, the mood, the texture. You want to generate new images in that same style. But describing a style in text is imprecise. "Warm, moody, cinematic" could mean a thousand things.
The Solution
IP-Adapter (Image Prompt Adapter) takes a reference image and uses it as a style guide. Instead of (or alongside) text conditioning, it conditions the diffusion model on the visual features of the reference image.
Input: Text prompt + Reference image (for style)
Output: Generated image with the CONTENT of the prompt and the STYLE of the reference
How It Works
IP-Adapter uses a CLIP image encoder to extract visual features from the reference image. These features are injected into the diffusion model's cross-attention layers alongside the text embedding.
Text prompt -> CLIP text encoder -> text features
Reference image -> CLIP image encoder -> image features
Both features -> Cross-attention in denoiser -> Generation guided by text AND image
The model attends to both the text (what to generate) and the reference image (how it should look). A weight parameter controls the balance between text adherence and style adherence.
Practical Applications
Brand consistency: Generate marketing images that all match a brand's visual identity. Use one "brand reference image" as the IP-Adapter input across all generations.
Character consistency across scenes: Generate a character in different scenes using a reference image of that character. The character's visual identity carries across, while the scene changes based on the text prompt. Not perfect, but significantly better than text-only prompting.
Art direction from mood boards: A designer selects a mood board image. IP-Adapter transfers that mood to all generated assets.
The Consistency Problem
The hardest unsolved problem in image generation: maintaining a consistent character, object, or style across multiple images.
If you generate "a woman with red hair in a coffee shop" five times, you get five different women. Same prompt, different face every time.
Putting It All Together
A real production workflow might combine all three:
1. Train a LoRA on 20 photos of the product (15 min, once)
2. Create a ControlNet depth map for the desired composition
3. Use an IP-Adapter with a brand reference image for style consistency
4. Generate with: LoRA (product identity) + ControlNet (layout) + IP-Adapter (style)
Result: a product image that shows YOUR product, in the composition YOU specified, in YOUR brand style. From text prompt to final image in 5 seconds.
This is the workflow that eliminates the $5,000 product shoot. Not for hero images (where human photography still wins), but for the 200 variant images needed for e-commerce listings, social media, email campaigns.
Summary of Control Mechanisms
| Mechanism | What It Controls | Training Required | At Generation Time |
|---|---|---|---|
| LoRA | Identity (face, object, concept) | Yes (15-45 min) | Apply as adapter |
| ControlNet | Spatial layout (pose, edges, depth) | Pre-trained (download) | Provide control image |
| IP-Adapter | Style (colors, mood, aesthetic) | Pre-trained (download) | Provide reference image |
| CFG Scale | Prompt adherence strength | No | Set parameter (7-12) |
| Seed | Reproducibility | No | Set specific seed number |
| Negative prompt | What to avoid | No | List unwanted elements |
This completes the image generation block. This is post 11 of the AI Engineering Explained series.
Next post: How AI Makes Video — from static frames to temporal coherence, DiT architectures, and why video is 100x harder than images.