Skip to content
← All lectures

CS 153 · Lecture 4

The Discipline of Delivering Value per Gigawatt

Amin Vahdat · VP & GM, ML, Systems, and Cloud AI, Google

5 min readGuest lectureFree

Amin Vahdat runs Google's internal compute, the fleet of TPUs behind Gemini. He argues the whole industry is measuring the wrong thing when it counts gigawatts and flops.

The big idea

A gigawatt of AI infrastructure costs roughly $40 billion, but two gigawatts are not equal. What matters is how much capability, reliability, and value you extract from every watt and dollar, measured in something real like happy daily active users or intelligence per dollar. Getting there depends on two hard disciplines: keeping nodes reliable (at Google, under 96% node allocation is treated as a major outage) and getting system balance right, so flops, memory bandwidth, and network are provisioned in the correct ratio instead of starving each other.

Value per gigawatt

One gigawatt of buildout is about $40 billion of infrastructure, and prices are rising toward $50 billion. Vahdat argues gigawatts and dollars-per-gigawatt are broken measures. The question is value per dollar: if you can deliver the same capability from half the capacity, you win, and you then need to build fewer gigawatts. Reframed at the product layer, the metric is happy daily active users, paying customers, or developers getting work done, not the raw capacity sitting behind them.

Reliability is the multiplier

A gigawatt is roughly 150,000 to 200,000 accelerators, and training runs them synchronously, so one node failing can halt the whole computation. At Google, if node allocation drops below 96% it is considered a major outage. The old web-scale playbook of loose coupling and shrugging off single failures is gone, because in a training job every node holds a specific piece of the model and is not fungible.

System balance

Scaling flops is easy; building a balanced supercomputer is hard. Vahdat leans on Amdahl's law from 1967: for every unit of compute you provision, you need matching IO to feed it, or the compute starves. Today that IO is memory bandwidth (HBM) and network bandwidth. The move to mixture-of-experts and sparse models means current hardware is often out of balance, needing more memory bandwidth relative to compute, which is part of why measured utilization (MFU) is low.

Access over reliability

Five nines of availability means 30 seconds of downtime a year, and delivering it means running redundant 2N power feeds where half your capacity sits idle. Historically enterprises demanded five nines. Now frontier labs training models will happily trade a few days of downtime per year for double the usable capacity, because training is about throughput, not uptime. Vahdat calls this a recent and genuinely new shift in what customers ask for.

Optical circuit switches

Google's TPU differentiator is availability. Racks of 64 TPUs are wired into a 3D torus, and optical circuit switches, chips with 136 tiny steerable mirrors, act like a robot that unplugs and replugs fiber under software control. If a rack fails, a spare rack drops into the exact same topology position in seconds, keeping the torus whole. The same switching also lets Borg point bandwidth at whichever storage cluster a five-hour job needs, avoiding building network everywhere.

Energy and lead times

Vahdat's answer for the single biggest bottleneck he cannot fix with money is energy. Lead time for a net-new gigawatt is two to three years: land, permitting, grading, and a utility contract that now demands paying for the power 24/7 for 20 years because the grid has no slack. Near-term the proven path is solar, wind, and batteries; data centers in space are five-to-ten years out and carry risk. He wants each data center to be an uplift for its local grid and community, including water-sparing designs and giving power back to the grid on peak-demand days.

Key takeaways
  • One gigawatt of AI buildout costs roughly $40 billion, and the number is climbing toward $50 billion.
  • Stop measuring gigawatts and flops; measure value per dollar, or concretely, happy daily active users and intelligence per dollar.
  • A gigawatt is around 150,000 to 200,000 accelerators, and in synchronous training one dead node can stop everything.
  • Google treats node allocation below 96% as a major outage, because reliability is what turns capacity into delivered value.
  • Amdahl's law from 1967 still holds: provision matching IO (memory and network bandwidth) for your compute or it starves.
  • Frontier labs now trade uptime for capacity, a reversal of the old enterprise demand for five nines.
  • Energy is the bottleneck Vahdat can least solve; a new gigawatt has a two-to-three-year lead time.
  • Specialization is winning: Google's 8th-gen TPUs split into separate inference (8i) and training (8T) chips for the first time.

In their words

The measure isn't how much money you spent per gigawatt, it's actually how much value you deliver per dollar.
Amin Vahdat
Scaling flops is easy. Building a coordinated supercomputer that scales out to 10,000, 100,000 TPUs that has the right balance point, super hard.
Amin Vahdat
There's no such thing as winners and losers in the real world. They're just people who get done and who don't.
Amin Vahdat

Terms to know

Gigawatt (of infra)
The unit the industry uses to size an AI datacenter buildout; about $40 billion and 150,000-200,000 accelerators.
MFU / goodput
Model FLOPs Utilization and useful output; how much of the hardware you actually turn into real work rather than idle waste.
System balance
Provisioning flops, memory bandwidth (HBM), and network in the right ratio so no part starves the others.
Amdahl's law
1967 rule that a parallel system needs matching IO per unit of compute, or the compute cannot be fed and sits idle.
Optical circuit switch
A chip of steerable mirrors that reroutes fiber in software, letting a spare rack instantly replace a failed one in the topology.
Five nines
99.999% availability, about 30 seconds of downtime a year, which requires costly 2N redundant power.
Watch the full lecture

Amin Vahdat at Stanford CS 153: Frontier Systems

New to this? Come build with us.

Reading is good. Building with people is better. Our drop-ins are free and open to total beginners.