Inside the model: predicting the next token

The takeaway

An LLM is a fixed mathematical function with billions of tunable numbers that only ever does one thing: read the tokens so far and output a probability for every possible next token. It has no memory, no tools, and no internet inside it, and each step gets a fixed, small budget of computation. That is why the same prompt gives different answers, why it recalls facts vaguely rather than exactly, and why it works better when it spreads its thinking across many tokens instead of jumping to an answer.

What the network actually is

Strip away the chat interface and a large language model is one thing: a giant mathematical function. It takes a sequence of tokens as input, mixes them together with a fixed set of internal numbers, and produces an output. Those internal numbers are called parameters, or weights. A model like Llama 3 has 405 billion of them. GPT-style models today run into the hundreds of billions or more.

Karpathy compares the parameters to knobs on a DJ set. Training is the process of finding a setting of all those knobs so that the model's predictions line up with the patterns in its training data. Once training finishes, the knobs are frozen. When you chat with a model, you are talking to a version whose parameters were locked months earlier and have not moved since.

It helps to be clear about what is not inside this function. There is no memory that carries over between your messages beyond the text on screen. There is no database of facts it looks things up in. There is no live internet connection humming away inside the weights. The whole thing is what Karpathy calls a stateless expression: the same input runs through the same fixed math every time, with no side effects and nothing remembered.

It's a fixed mathematical expression from input to output with no memory. It's just stateless.

One token at a time

The model's only real skill is predicting what comes next. You feed in the tokens so far, and the network outputs a number for every possible token in its vocabulary, roughly 100,000 of them for a GPT-4-class model. Each number is that token's probability of coming next. Most are tiny. A few are meaningfully large.

Generation, which the field calls inference, is a loop built on top of that single skill. Take the probabilities, sample one token from them, append it to the sequence, then feed the longer sequence back in and repeat. Karpathy describes the sampling step as flipping a biased coin: tokens the model rates as likely get picked more often, but a less likely token can still come up. The model writes one word-piece, reads back everything including what it just wrote, and predicts again.

Nothing about this loop understands your whole request up front and then executes a plan. The model commits to token one before it has any idea what token fifty will be. Everything it produces, it produces by extending the sequence one small step at a time.

Input: the tokens so far.
Output: a probability for every possible next token.
Sample one token from that distribution.
Append it, feed the longer sequence back in, repeat.

Why the same prompt gives different answers

Because generation samples from a probability distribution, it is stochastic. Karpathy shows this directly with a base model: he types the same prompt twice and gets two different continuations, because each run draws different tokens from the same distribution and then wanders off in a different direction. One early choice nudges the next, and within a few tokens the two answers have diverged completely.

A setting called temperature controls how much randomness enters the coin flip. Lower temperature makes the model lean hard toward its highest-probability tokens, so answers come out more repetitive and predictable. Higher temperature flattens the odds, so lower-probability tokens get picked more often and the output gets more varied and more surprising.

This is worth internalizing as a user. If a model gives you a wrong or oddly-worded answer, running the same prompt again can produce a genuinely different result. It is not the model changing its mind. It is the same fixed function, sampled again, landing on a different path.

A lossy zip of the internet

During training the model saw something like 15 trillion tokens of text. It cannot store all of that. What it stores instead is a compressed version, squeezed into its parameters. Karpathy calls the weights a kind of zip file of the internet, but a lossy one. Like a heavily compressed photo, the general shape survives and the fine detail is gone.

That is why the knowledge inside a base model is a vague recollection rather than exact recall. Ask it for the top landmarks in Paris and it will produce a plausible list, but you cannot fully trust every detail, because none of it is stored explicitly. It is all reconstructed from statistical traces. Things that appeared often in the training text, Karpathy notes, are remembered far better than things that appeared rarely.

There is a sharp exception. When a document shows up many times in training, such as a well-known Wikipedia article, the model can memorize it and recite long stretches word for word. Karpathy calls this regurgitation, and it is usually something builders try to avoid. Most of the time the model is not reciting. It is producing a remix that has the same statistical flavor as its training data without matching any single document.

This knowledge is not precise and exact. This knowledge is vague and probabilistic and statistical.

Weights are a compressed, lossy version of the training text.
Common facts survive better than rare ones.
Recall is reconstruction, not lookup, so details can be wrong.
Heavily repeated documents can be memorized and recited verbatim.

A fixed budget per token

Here is the limit Karpathy stresses most, and the one that changes how you should prompt. Every time the model produces a single token, the input runs through the same fixed sequence of steps in the network exactly once. That means each token gets a fixed, finite, and fairly small amount of computation. The model cannot decide to think harder on a hard token. Its per-token budget does not stretch.

Take a word problem where the answer is $3. If you push the model to answer immediately, it has to cram the entire calculation into the one token where it emits the number. That is more work than a single token's budget can hold, so it guesses and often gets it wrong. Everything it writes afterward is just after-the-fact justification, because the answer is already fixed in the sequence.

The fix is to let the model spread the work out. When it writes intermediate steps, the total cost of the oranges is $4, so $30 minus $4 is $9, and so on, each token only has to carry one small piece of arithmetic. By the time it reaches the final line, the earlier results are already sitting in front of it and stating the answer is easy. Karpathy's phrase for this is spreading the reasoning and the computation across many tokens.

This is the mechanism underneath advice like let the model show its work or think step by step. It is not a personality trick. A step-by-step answer gives the model more tokens, and therefore more total computation, to reach a result it could never have computed in one jump.

There can never be too much work in any one of these tokens, because then the model won't be able to do that later at test time.

Recollection versus working memory

Karpathy draws one more distinction that ties the whole picture together. Knowledge baked into the parameters is a vague recollection, like something you read a month ago. The tokens sitting in the current context window, the text of your conversation so far, are the model's working memory, like something you experienced a few minutes ago and can still handle precisely.

The practical difference is large. Data in the context window feeds straight into the network and is directly available, so the model can reference it exactly. Data buried in the weights has to be reconstructed and might come out wrong. This is why pasting the relevant text into your prompt, rather than trusting the model to remember it, reliably produces sharper answers.

So what sits behind the chat box is narrow and knowable. A fixed function, no memory of its own, a small compute budget per token, guessing the next token again and again. The surprising range of things it can do comes from that one operation run at enormous scale, not from anything hidden inside.

Watch the full 3.5-hour video

Read it, then build it with people.

Bring this to a free Oslo Vibe Coding drop-in and put it to work with people around you.

RSVP for the next session New to this? Start here