Mati Staniszewski walks through building ElevenLabs from a Discord text-to-speech bot to a company past $430M in revenue in 36 months, and lays out where voice AI goes next.
ElevenLabs won by staying obsessively focused on one modality and one hard problem: making generated speech sound genuinely human. Rather than build every part of the dubbing pipeline, they picked the single component that mattered most (text-to-speech generation) and pushed it past the state of the art, then expanded modality by modality. Voice becomes a core interface because it carries emotion, tone, and context that text strips away, and the systems that capture that will define how people and businesses talk to machines.
The Polish dubbing problem
Staniszewski and co-founder Piotr, both Polish and ex-Google/Palantir, were bothered that foreign films in Poland are narrated by a single flat monotone voice reading every character. Their founding bet was that the future lets anyone access any content in any language with real tonality and emotion. That is the AI dubbing problem they set out to solve.
The cascaded pipeline
Dubbing needs three chained models: transcription (who spoke, strip background noise), translation via an LLM, and text-to-speech to recast the audio in the new language while borrowing the original performance. Each component had to work for the whole to work. In 2022, before the GPT moment, LLM translation was still poor, so the full pipeline broke.
Innovate on the last mile only
Talking to studios and creators revealed simpler needs first, like voice-over corrections and reading scripts. So they narrowed to the common denominator across every use case: text-to-speech generation. They chose not to innovate on transcription or the LLM and instead pushed the frontier on making speech sound natural, replicate a voice's characteristics, and follow the delivery a sentence's context demands.
Tiny team, tiny compute
The first ElevenLabs checkpoint was inspired by Tortoise TTS, an open-source model James Betker built at Google on nights and weekends that sounded human on short fragments but was slow and unstable on longer text. ElevenLabs' early models were hundreds of millions of parameters, trained on free credits from programs like NVIDIA Inception, costing tens of thousands of dollars. They skipped a $6,000 patent as too expensive at that stage.
Year by year to real time
2022 was the first natural-speech breakthrough. 2023 brought voice cloning, a voice marketplace, and creator tooling. 2024 delivered AI localization, seen when Javier Milei's UN speech and Lex Fridman's interviews with world leaders were dubbed while keeping each speaker's iconic delivery. 2025 finally made models fast enough for real-time interactive voice agents.
Cascaded versus fused
The open question for voice agents is whether to keep three separate models (cascaded) or train one that generates speech tokens directly (fused). Staniszewski's read: cascaded wins on reliability, intelligence, and tool-calling for enterprise use, so it is right for the next few years. Fused wins on latency (roughly 300ms) and is better for companion-style uses where reliability matters less. Most customers will blend both depending on the moment.
The business engine
ElevenLabs crossed 2025 at $330M in revenue and added over $100M ARR in a quarter to pass $430M, with about 450 people in teams of under 10 that own their decisions and move fast. Revenue scales predictably with deployment, not just compute. On pricing, Staniszewski's rule: start from the value you deliver, never the cost, and aim to capture about one tenth of that value.
- Pick one modality and one hard problem, push it past the state of the art, then expand. ElevenLabs did voice generation first, then transcription, dubbing, agents, and music.
- Being close to users via Discord and PLG taught them which of three pipeline components (transcription, translation, TTS) to actually fix first.
- Text-to-speech was the common denominator across every customer need, so improving it unlocked the most use cases at once.
- The two 2022 breakthroughs: making delivery follow a sentence's emotional context, and letting the model infer voice characteristics instead of hardcoding gender, accent, and age.
- Cascaded architectures are best when reliability, intelligence, and tool-calling matter; fused models win on latency for lower-stakes interactions.
- Emotionality is now becoming controllable in both approaches, which required a large data-labeling effort to teach models what happy, sad, or stressed delivery sounds like.
- Voice should not be used for authentication anymore, because voices are now easy to replicate; safety comes from watermarking and traceability instead.
- Price from the value delivered, target capturing one tenth of it, and never price up from cost.
In their words
“You can go further together.”
“Never start from the cost, start from the value and work backwards from there.”
“We crossed 2025 at 330 million in revenue, and this quarter was our biggest quarter.”
Terms to know
- Cascaded architecture
- A voice system that chains three separate models: speech-to-text, an LLM, then text-to-speech.
- Fused (omni) model
- A single model trained to go straight from incoming audio to generated speech tokens without a text step, favoring speed over reliability.
- Text-to-speech (TTS)
- Turning written text into spoken audio; the generation step ElevenLabs chose to specialize in.
- PLG
- Product-led growth, letting creators and developers adopt the product directly rather than through a sales motion.
- Tortoise TTS
- An open-source speech model by James Betker that first sounded human on short fragments but was slow and unstable on long text.
- Middle-to-middle
- Using AI as an iterative step inside a creative workflow rather than an end-to-end prompt-to-output button, to avoid AI slop.
Mati Staniszewski at Stanford CS 153: Frontier Systems
New to this? Come build with us.
Reading is good. Building with people is better. Our drop-ins are free and open to total beginners.