How an LLM is trained: the internet in a box

The takeaway

Pretraining takes a filtered copy of the public internet, chops it into tokens, and trains a giant neural network to predict the next token over and over. What comes out is a base model: a statistical simulator of internet documents that has absorbed a rough, lossy picture of the world, but does not yet answer your questions.

Download the internet

The first step in building a model like ChatGPT is to collect an enormous pile of text. Andrej Karpathy points to FineWeb, a public dataset from Hugging Face that is representative of what the big labs build internally. OpenAI, Anthropic, and Google each have their own version. The goal is the same for all of them: a huge quantity of high-quality documents, covering as wide a range of topics as possible, so the model ends up knowing about many things.

Most of that text starts life at Common Crawl, an organization that has been scraping the web since 2007. Its crawlers begin with a few seed pages, follow every link, and keep going. By 2024 the crawl had indexed on the order of billions of web pages. That raw haul is the starting point, and it is messy.

One number is worth sitting with. After all the collecting and filtering, FineWeb is only about 44 terabytes. That fits on a single large hard drive. The internet feels infinite, but once you keep only the text and throw away the junk, the useful part is surprisingly small.

The internet feels infinite. Once you keep only the text and throw away the junk, it fits on a hard drive.

Filter out the junk

Raw crawl data is not something you would want to train on directly, so it passes through several filtering stages. Each one is a design decision, and different companies make different calls.

The steps are practical and unglamorous. URL filtering drops entire domains using blocklists: malware sites, spam, pure marketing pages, racist sites, adult sites. Text extraction pulls the actual words out of the raw HTML and discards the navigation, menus, and styling code. Language filtering guesses the language of each page and keeps only what the lab wants. FineWeb, for instance, keeps pages that are more than about 65 percent English, which is why a model trained on it will be strong in English and weaker elsewhere. After that come deduplication and the removal of personally identifiable information such as addresses and social security numbers.

What survives is plain text. Concatenate the first couple hundred cleaned pages and you get a giant tapestry: an article about tornadoes, an odd medical note about your adrenal glands, and so on. This is the raw material. The next step is teaching a neural network to mimic how that text flows.

URL filtering: drop malware, spam, marketing, racist, and adult domains.
Text extraction: keep the words, discard HTML markup and navigation.
Language filtering: a lab choosing 65 percent English is choosing weaker Spanish later.
Deduplication and PII removal: cut repeats, strip addresses and other private data.

Why text becomes tokens

A neural network cannot read a paragraph the way you do. It expects a one-dimensional sequence drawn from a fixed, finite set of symbols. So before training, the text has to be converted into that form. This conversion is called tokenization, and the symbols are called tokens.

You could go to the extreme and represent everything as raw bits, just zeros and ones. That gives you only two symbols but makes the sequence enormously long, and sequence length is a precious, expensive resource. The trade Karpathy describes is between vocabulary size and sequence length: use more distinct symbols and you get shorter sequences. Grouping bits into bytes gives you 256 possible symbols. A method called byte-pair encoding then goes further, repeatedly finding the most common pair of symbols and minting a single new symbol for it. Run that enough times and you land on a vocabulary of roughly 100,000 tokens. GPT-4 uses 100,277.

A token is usually a chunk of a word, not a single letter and not a whole word. Feed 'hello world' into GPT-4's tokenizer and it becomes two tokens. Add a capital letter or an extra space and the split changes. This matters more than it looks. The model never sees letters directly; it sees these chunks. That is a big reason base and chat models can stumble on spelling and counting tasks that seem trivial to a human. FineWeb's 44 terabytes of text works out to about 15 trillion tokens, and that sequence of token IDs is what actually gets trained on.

The model never sees letters. It sees chunks. That single fact explains a lot of its odd failures.

Predict the next token

With the internet turned into a long sequence of tokens, training can begin. The task is almost absurdly simple to state. Take a window of tokens from the data, feed it in as context, and ask the network to predict the token that comes next. The network outputs a probability for every one of the roughly 100,000 possible tokens. Because you took the window from real data, you already know the right answer, so you can nudge the network to raise the probability of the correct token and lower the rest.

That nudge happens through the parameters, or weights, of the network. Karpathy compares them to knobs on a DJ set: billions of them, set randomly at the start, adjusted a little with every update. A modern network is a giant mathematical expression, but the individual operations inside it are ordinary things like multiplication and addition. The specific architecture behind today's models is called the Transformer. You do not need its internals to grasp the point: it turns an input sequence into a next-token prediction, and training means finding a setting of the knobs that makes those predictions match the statistics of real text.

Do this across the whole dataset, in parallel, on billions of windows, and a single number called the loss slowly falls. Low loss means better predictions. Watch a fresh model train and its output starts as pure gibberish; a little later it produces something with local coherence; given enough time it writes fairly fluent English. Nobody wrote rules for grammar or facts. The rules got absorbed from the data, one next-token guess at a time.

The scale and the bill

This is where the money goes. The training runs on GPUs, chips built for the massively parallel math these networks need. Karpathy rents an 8x H100 node from a cloud provider at a few dollars per GPU per hour. Stack many of these into a data center and you have the machines every large tech company is fighting to buy. That demand is a big part of why Nvidia's value climbed into the trillions.

The trend over time is steep. OpenAI's GPT-2, from 2019, had about 1.5 billion parameters and trained on roughly 100 billion tokens. It cost an estimated 40,000 dollars back then; today a comparable run costs on the order of a hundred dollars, thanks to better data, faster hardware, and better software. Meta's Llama 3 is a different scale entirely: its largest base model has 405 billion parameters and trained on 15 trillion tokens.

For a frontier model, pretraining can run for something like three months across thousands of GPUs and cost many millions of dollars. This is by far the most expensive stage. Everything that comes afterward, the work that turns the model into a helpful assistant, is comparatively cheap.

GPUs do the work: many parallel processors chewing through matrix math.
GPT-2 (2019): about 1.5 billion parameters, roughly 100 billion tokens.
Llama 3: a 405 billion parameter base model trained on 15 trillion tokens.
Frontier pretraining: months of time, thousands of GPUs, millions of dollars.

What a base model actually is

When pretraining finishes, you have a base model. It is not an assistant. Karpathy calls it an internet document simulator, or more bluntly a very expensive, glorified autocomplete. Ask a base model 'what is 2+2' and it will not reply 'four, anything else?' It treats your text as the start of some web page and continues it however the statistics suggest, wandering off into philosophy or a fresh list of questions. Run the same prompt twice and you get different answers, because the model samples from a probability distribution each time.

It is still remarkable, because predicting the next token forces the network to learn a great deal about the world and store it in its weights. Karpathy describes those 405 billion parameters as a compression of the internet, like a zip file, except lossy. You are left with a gestalt, a rough impression rather than an exact copy. Prompt it with 'my top 10 landmarks in Paris' and it will happily produce a list, but the details are a vague, probabilistic recollection. Facts that appeared often on the web are more likely to be remembered correctly; rare facts you should not trust.

Two demos sharpen the picture. Paste the first line of the Wikipedia article on zebras and the base model recites the entry almost word for word, because high-quality sources like Wikipedia get sampled more often during training and the model has effectively memorized them. This is called regurgitation. Now paste a prompt about the 2024 election, which happened after the training data was collected, and the model simply invents a plausible outcome, a different one each time. It is guessing from what it knows, and Karpathy names this for what it is: hallucination. Same mechanism, next-token prediction, in both cases.

A base model can still be useful if you are clever with prompts. Give it ten English-to-Korean word pairs and then an eleventh English word, and it continues the pattern correctly. This is in-context learning: the model picks up the rule from the examples in front of it. You can even fake an assistant by writing a prompt shaped like a transcript of a helpful AI talking to a human, and the base model will play along, though it tends to keep going and invent the human's next question too. These are tricks, not the real thing. To get a model you can simply ask a question and get a reliable answer from, you need the next stage, post-training, where the internet-document dataset is set aside and the model learns from curated conversations. That is the subject of the next article. For now, hold onto this: pretraining gives you a statistical echo of the internet, rich with knowledge but shaped like a document, not a helper.

A base model is a lossy zip of the internet. You can unpack a rough impression of the world, not the world itself.

Watch the full 3.5-hour video

Read it, then build it with people.

Bring this to a free Oslo Vibe Coding drop-in and put it to work with people around you.

RSVP for the next session New to this? Start here