High-Resolution Image Synthesis with Latent Diffusion Models

A plain summary, so you can get the gist here without leaving.

This 2022 paper is the one that put text-to-image generation in everyone's hands. It is the research behind Stable Diffusion, and its big idea was to make image generation cheap enough to run on a single ordinary graphics card.

What it is

A diffusion model learns to make pictures by starting from pure noise, like static on an old television, and slowly cleaning it up step by step until a real image appears. The trick is that during training the model is shown real images with noise added, and it practices guessing what the noise was. Do that millions of times and the model becomes very good at turning randomness into something that looks like a photo or a painting.

The problem before this paper was cost. Doing all that step-by-step cleanup directly on full-size images, where every pixel matters, takes enormous computing power. Only big labs could afford it.

The core idea

The authors' move was to stop working with the full pixel image and instead work in a compressed space they call the latent space. First a separate small network squeezes an image down into a much smaller code that keeps the meaningful structure and throws away fine detail that can be filled back in later. The diffusion happens inside that compressed code, which is far smaller, so each step is much faster and lighter.

Once the model has produced a finished code, a decoder expands it back up into a full, sharp image. They also added a clean way to steer the result with a text prompt, so you can ask for what you want in words. Same quality, a fraction of the compute.

Why it matters

Because the math now fit on a consumer machine, the model could be released openly. That is what turned image generation from a lab demo into a tool that students, artists, and small teams could actually use and build on. A whole ecosystem of tools and fine-tuned variants grew out of it.

For anyone learning to build with AI, this is a clean example of a recurring lesson. A smart change in where you do the work, rather than a bigger model, can unlock something for an entire community.

Key points

Diffusion models generate images by gradually removing noise, learned by practicing on noisy versions of real pictures.
The key innovation is doing this in a small compressed latent space instead of on full-resolution pixels, which slashes the compute needed.
An encoder compresses, the diffusion runs on the code, and a decoder expands it back to a full image.
Text prompts are wired in so you can describe what you want in plain words.
Low enough cost to run on a single consumer GPU, which is why it could be released openly as Stable Diffusion and spark a large community.

Open the original source

Rombach, Blattmann et al.

New to this? Come build with us.

Reading is good. Building with people is better. Our drop-ins are free and open to total beginners.

RSVP for the next session Browse the whole library