Andreas Blattmann helped invent Stable Diffusion in a small German lab with far less compute than Google or OpenAI. He argues the path to smarter AI runs through pixels, sound, and physical interaction, not just text.
Blattmann's core claim is that intelligence should be built the way babies learn it: first by observing natural signals like video and audio, then by interacting with the physical world. Text is a compact human invention that carries a lot of meaning per symbol, but it is only one narrow slice of how we understand the world. So instead of stacking vision on top of a language model, Black Forest Labs trains one multimodal model on natural data, because a model that watches a bottle hit a table AND hears the sound learns the physics far better than one trained on a single modality.
Latent diffusion origins
As a PhD student in Heidelberg around 2019, Blattmann and co-founders Robin and Patrick had far less compute than Google or OpenAI. Images are much higher-dimensional than text, so training a generator directly on raw pixels is wasteful. Their fix was latent diffusion: first train a compression model (like a learned JPEG encoder) that maps images to a smaller representation that still looks equivalent to humans, then run the generator in that compact latent space. This cut compute by orders of magnitude and became the algorithm behind Stable Diffusion, released in 2022.
Natural vs human-made signals
Blattmann splits data into natural representations (video, audio) and human-made ones (text). Natural signals come from sources we cannot control, like sunlight and sound, and they carry a lot of redundancy, which is exactly why images and video must be compressed before training. Text is the opposite: evolution stripped out redundancy so humans could communicate efficiently, so each symbol packs high information. He argues real intelligence starts from natural signals, the way a three-year-old who cannot yet read still understands the world better than a language model.
From unimodal to multimodal
Stable Diffusion was unimodal, a text-to-image model built for content creation. Today the frontier is one multimodal model trained jointly on image, video, and audio. The payoff is cross-modal correlation: if a model always hears a sound when two rigid bodies collide, it learns what is physically happening far better than a model that only sees. This unlocks capabilities beyond art, including physical AI, robotics, computer use, and world simulation, all from a single unified model.
Bootstrapping the Flux flywheel
When Black Forest Labs started, Blattmann's team focused narrowly on one gap: image models could not even draw hands with five fingers. In three months they shipped Flux 1, a text-to-image model aimed at being 10x better. Watching real users revealed they wanted precise control, not just text prompts, because a prompt like 'a blue bird' matches infinitely many images. That feedback led to Flux 1 Kontext, an editing model with reliable character consistency, letting you drop a real person into a new scene. Revenue from Kontext roughly doubled within six weeks, and Meta partnered with the roughly 25-person team to power image editing across its 2 billion users.
Open weights and verification
Visual quality is hard to verify because 'good' depends on who is looking, unlike code where you can run unit tests. That subjectivity is the argument for open models: give away good general weights and let Meta, or a government with different cultural preferences, customize the last mile for their own users. Physical tasks flip this. Once you hook a model to a robot, the real world enforces what is possible, so post-training becomes actual interaction that closes the feedback loop with verifiable data.
Diffusion vs autoregressive
Blattmann frames diffusion and language models as two sides of one coin. Diffusion models iterate along an artificial noise-to-image time axis, which makes them data-inefficient in training but lets you distill a 50-step model down to two or four steps for fast inference. Autoregressive language models iterate along the data itself, token by token, which is data-efficient to train but hard to speed up at inference. His open research question: combine the training efficiency of autoregressive models with the fast, distillable inference of diffusion models.
- Latent diffusion trains the generator in a compressed representation, not raw pixels, cutting compute by orders of magnitude and enabling Stable Diffusion.
- Blattmann argues intelligence should be built from natural signals (video, audio) first, then interaction, the way a young child learns before it can read.
- Training one multimodal model lets it learn cross-modal correlations, like the sound of a collision, which teaches physics better than any single modality alone.
- Focus was the founding move: Black Forest Labs attacked one narrow gap (hands with five fingers) and shipped Flux 1 in three months.
- The user feedback loop, not raw research, drove Flux 1 Kontext, an editing model that finally solved character consistency and doubled revenue in six weeks.
- Open weights create commercial value where preferences are subjective and heterogeneous, because customers can customize the model for their own audience.
- Diffusion models distill from 50 steps down to two or four for fast inference; autoregressive models are data-efficient but must generate token by token.
In their words
“You should start with from first principles how we humans do it, and that's clearly learning on natural representations by first observing, and second interacting.”
“We're training a unified multimodal model for natural representations on natural data that then can give rise to so much more.”
“The mark of a good leader is to not panic, keep calm, look at the data, assess the landscape, and then come up with a plan step by step.”
Terms to know
- Diffusion model
- A generator that starts from pure noise and iteratively denoises it into an image, video, or audio sample.
- Latent space
- A compressed representation of an image that stays perceptually equivalent to humans but is far smaller and cheaper to model.
- Multimodal model
- A single model trained jointly on several natural signals (image, video, audio) rather than one type of input.
- Character consistency
- Editing an image so a specific person or object stays recognizably the same across new scenes and prompts.
- Adversarial diffusion distillation
- A technique that compresses a many-step diffusion model into a two-to-four-step one for fast, cheap generation.
- Self-Flow
- Black Forest Labs' published method for aligning a generative model's internal representations across multiple modalities so it understands, not just draws.
Andreas Blattmann at Stanford CS 153: Frontier Systems
New to this? Come build with us.
Reading is good. Building with people is better. Our drop-ins are free and open to total beginners.