Unified Intelligence Systems — Amit Jain (Luma AI)

Amit Jain explains why Luma abandoned 3D capture for video, and why the next models must fuse language-style reasoning with the physics understanding of visual models into one backbone.

The big idea

Today's image and video models are 'beautiful pixel generators' with almost no understanding, no memory, and no multi-turn ability. Luma's bet is a single transformer backbone that encodes audio, image, video, text, and code into one space and reasons about them together, like a brain where the modality-specific parts are just encoders feeding one cortex. The point is not prettier pixels but intelligence expressed in whatever medium the user needs, generation and understanding welded together the way an LLM already does for text.

Design around the data

Luma started betting on 3D because it carries more information than images, and built a popular capture app (Luma 3D Capture) that productionized NeRF and Gaussian splats. It never mattered: no single company can outscale decades of internet photos and video. The lesson is to design algorithms around where the data already exists, not invent a pristine algorithm and starve it. Robotics hits the same wall now, there is no internet of action data.

Why video, then why not enough

When Nvidia's Hopper GPUs arrived in 2023, learning the world from video became feasible, since video is two dimensions of space plus one of time and the brain learns 3D through that time proxy. Luma shipped Dream Machine in March 2024 and hit 6 million users in weeks because Sora was announced but unreleased. By early 2025 the realization repeated on an annual cycle: video alone lacks human logic, causality, and why an event matters.

Unified vs disparate towers

In 2025 the field bolted separate language, image, video, and audio towers together with a thin fusion bridge, the way Stable Diffusion attaches a small text component. Jain says that thin bridge is the ceiling. Nano Banana is a large diffusion tower and a large language tower joined by a narrow 700-800M-parameter encoder, so it could not draw his schematic diagrams. Luma spent about a year and many failed attempts building a genuinely single backbone that now scales to hundreds of billions of parameters.

Understanding plus generation

In text, an LLM understands and generates in one model with no delta between them. In visual AI those two halves are split: VLMs understand images but cannot generate, diffusion models like Flux generate but do not understand. Jain's frame is that pixels carry information the same way words do, a poem and a math proof are both just words, so how you arrange pixels determines their intelligence. Unified models express that intelligence in text, slides, or video as convenient.

The end-to-end factory

Luma's stack is a model at the bottom orchestrating tool calls and reading multimodal input, a tool harness in the middle for Linux and APIs, and a fat skills layer on top holding domain knowledge as context. The slides in the talk were made one-shot by Uni 1 using an internally-written 50-page slide-design skill. Deployment is a REPL loop (read-eval-print), and Luma bets on mega-models sharing deep connective tissue rather than many small federated models with a judge on top.

The creative economy bet

Luma targets roughly 120 million creatives, two-to-three times the number of coders, whose daily job is replicating real-world physics in computers. Customers include Netflix and Amazon Prime (Prime's Moses show, at $4.5M per episode, is produced largely with Luma agents), Publicis, and Coca-Cola moving $3 billion of annual content production. Luma raised about $1.5 billion total, a billion in the last year, and argues visual AI is a strict superset of language work that will surpass LLMs as robotics arrives.

Key takeaways

Scale of data wins over data quality: design your algorithms around where the data already is, because no company can outscale the internet.
Luma pivoted from 3D capture to video because video packs space plus time and no app can capture enough 3D to learn the world.
Dream Machine reached 6 million users in weeks in 2024 largely because Sora was announced but not released.
Fused architectures with a thin bridge (like Nano Banana's ~700M-parameter encoder) hit a hard ceiling on visual reasoning tasks.
A unified model encodes text, image, video, audio, and code into one transformer backbone and reasons about them in one space, like a single cortex.
The multi-turn breakthrough that made LLMs useful (RLHF, memory, iteration) is exactly what image and video models still lack.
The product is part of the lab: user preference and interaction traces are the feedback loop that makes each next model better.

In their words

“When someone says slop, it means they have never seen or used a good AI system before.”

Amit Jain

“Image models and video models that are not unified models are really really stupid. They have no understanding of what the hell they're generating.”

Amit Jain

“Just like language models produce words, how you arrange the pixels determines what they're conveying and how intelligent they are.”

Amit Jain

Terms to know

Unified model: One transformer backbone that both understands and generates across text, image, video, and audio, instead of separate stitched-together models.
Differentiable: A function you can put in a training loop with a loss, so gradient descent can optimize it; if it isn't differentiable, deep learning can't learn it.
NeRF / Gaussian splats: Techniques that reconstruct a 3D scene from 2D photos; Luma was first to productionize them in its capture app.
VLM: Vision-language model: it can understand images but cannot generate them, unlike a diffusion model which generates but doesn't understand.
Skills layer: Domain knowledge given to the model as context (like a 50-page slide-design doc), not baked into weights or tools.
PE mindset: Jain's term for private-equity thinking in Hollywood: rent-seek a proven franchise with endless sequels instead of trying many new ideas.

Watch the full lecture

Amit Jain at Stanford CS 153: Frontier Systems

New to this? Come build with us.

Reading is good. Building with people is better. Our drop-ins are free and open to total beginners.

RSVP for the next session All lectures