Jensen Huang argues computing is being reinvented for the first time in 64 years. The unit of design is no longer the chip, it is the whole rack, and the metric that matters is tokens per watt.
For 64 years the computer stayed roughly the same, and now it does not. Software is generated by neural networks in real time instead of pre-recorded and retrieved on demand. NVIDIA's edge is co-design: optimizing algorithms, compilers, chips, networking, and storage together rather than in separate silos. That co-design bought about a million times more compute over ten years, which is what let researchers stop curating data and just feed the model everything.
Computing is being reinvented
The mental model of a computer has been stable since the IBM System/360 in 1964. Huang's own first architecture textbook was the System/360 manual. What changed is that old computing was pre-recorded content you retrieve on demand, and new computing is generated in real time, so it can be contextually relevant and respond to intent, not just explicit instructions. That shift rewrites every layer: how you write software, how you run it, the systems, the networking, the applications.
Co-design and the million-x
In the old world, chip designers, compiler writers, and language people worked in separate fields. Co-design, in the tradition of John Hennessy's RISC work, optimizes them together: a simpler machine matched to a smart compiler beats two parts each optimized alone. Moore's Law gave about 10x every five years but stalled when Dennard scaling ran out roughly a decade ago. NVIDIA's co-design across CPUs, GPUs, networking, switches, and storage delivered about a million times more compute over ten years instead.
Measure tokens per watt, not FLOPs
FLOPs is a contrived metric: necessary but not sufficient. Huang says he would rather run at LOW MFU (model FLOPs utilization), because that means he is over-provisioned on FLOPs, memory bandwidth, network, and capacity, so no single resource bottlenecks the work (avoiding Amdahl's law). The real unit is tokens per watt, roughly intelligence per watt. He cited a leaked memo that XAI's Memphis cluster runs near 11% MFU.
Prefill vs decode, disaggregated
Running a large model has two phases: prefill (context and attention processing) and decode (generating the output tokens). Decode is the memory-bandwidth-hungry part, and the bandwidth needed is far more than one chip can supply. So NVIDIA ganged 72 GPUs into the first rack-scale computer, Grace Blackwell NVLink 72, delivering high tokens per watt even at very low MFU by disaggregating decode from prefill. That generation was a 50x speedup in two years, where Moore's Law would have given 2x.
A roadmap keyed to the workload
Each chip generation is designed for the compute pattern of the moment. Hopper was built for pre-training. NVLink 72 / Grace Blackwell for inference and decode. Vera Rubin for agents, which need long-term memory in storage wired straight to the GPU and a low-latency CPU (Vera) for the single-threaded tool calls agents fire off. Feynman is aimed at swarms of agents with subagents of subagents.
The economics of intelligence
If computing is a million times faster, everything about computing changes, the way society would change if you could cross the country in ten minutes. Because future compute is both generated and continuous rather than initiated per use, Huang estimates we will need on the order of a thousand times more energy, and would not be shocked to be off by a couple orders of magnitude. The levers are energy efficiency through co-design, plus market forces now strong enough to build sustainable energy without subsidies.
- The computer's mental model was stable for 64 years since the IBM System/360, and generative real-time compute breaks it.
- Co-design (chips + compilers + networking + storage optimized together) bought roughly a million times more compute over ten years versus about 100x from Moore's Law.
- The reason models train on the whole internet is that co-design made compute so fast that curating data became unnecessary.
- FLOPs is the wrong scoreboard; tokens per watt (intelligence per watt) is the real one, and low MFU can mean you are healthily over-provisioned.
- Decode, not prefill, is the bandwidth bottleneck in inference, which is why NVIDIA built the 72-GPU NVLink rack and disaggregated the two phases.
- The chip roadmap tracks the workload: Hopper for pre-training, Grace Blackwell for inference, Vera Rubin for agents, Feynman for agent swarms.
- AI compute will likely need about a thousand times more energy, making this the best-ever moment to build sustainable power because the market now pays for it.
In their words
“In the case of Nvidia and co-design, we got 1 million x over 10 years.”
“I just delivered incredibly high tokens per watt with extremely low MFU.”
“The goal of AI is not training. The goal of AI is inference.”
Terms to know
- Co-design
- Optimizing algorithms, compilers, chips, networking, and storage together instead of as separate specialized fields.
- MFU (Model FLOPs Utilization)
- The percentage of a chip's FLOPs actually used during work; Huang prefers it low, meaning over-provisioned rather than starved.
- Tokens per watt
- Output tokens generated per unit of energy, Huang's proposed real measure of intelligence delivered, driven more by bandwidth than FLOPs.
- Prefill vs decode
- Prefill processes the input context and attention; decode generates output tokens and is the memory-bandwidth-hungry phase.
- NVLink 72 / Grace Blackwell
- NVIDIA's rack-scale computer ganging 72 GPUs so decode has enough aggregate memory bandwidth; a 50x jump in two years.
- Dennard scaling
- The physics that let transistors shrink at constant power density and underpinned Moore's Law; it ran out roughly a decade ago.
Jensen Huang at Stanford CS 153: Frontier Systems
New to this? Come build with us.
Reading is good. Building with people is better. Our drop-ins are free and open to total beginners.