Teaching models to think: RL and reasoning

The takeaway

Reinforcement learning is the practice-problem stage: the model tries many solutions, keeps the ones that reach the right answer, and discovers its own problem-solving moves instead of only imitating humans. On verifiable problems like math and code this produces real reasoning, long internal chains of thought that emerge on their own. On fuzzy tasks, where a reward has to be simulated, the same process eventually games the simulator, so it cannot be run forever.

The third stage of school

Karpathy compares training a model to putting a child through school, using a textbook as the map. A textbook has three kinds of content. Most of it is exposition, the background knowledge you read to understand a topic. Reading exposition is pre-training, where the model builds its knowledge base from the whole internet.

Next come worked problems, where an expert shows the full solution. Studying those is supervised fine-tuning: the model imitates ideal answers written by human labelers and learns to behave like a helpful assistant. Both of these stages have been standard for years, and every provider does them.

The third kind of content is the practice problems at the end of each chapter. You get the question and the final answer from the answer key, but not the solution. You have to try things and discover what gets you there. That is reinforcement learning, and it is the last major stage of training. It is also the newest and least settled, which is why a public paper about it landed as a big deal.

Why imitation is not enough

Take a simple problem: Emily buys three apples and two oranges, each orange costs two dollars, the total is thirteen dollars, what does an apple cost. There are many ways to write a correct solution. Some set up an equation, some talk through it in English, some jump straight to the answer of three. All reach the right number.

If you are the human writing training data, you do not actually know which solution is best for the model to learn. A model can spend only a small, fixed amount of computation per token, so a step that is trivial for you might be too big a leap for it, and a step you find hard might be trivial for it. Its cognition is not yours. You might skip detail it needs, or spell out detail it finds obvious.

So a human-written solution is a decent way to get the model into the right neighborhood, but it is the wrong tool for finding the exact path. The model needs to discover, for itself, which sequence of tokens reliably gets from the question to the answer. That discovery is what reinforcement learning is for.

Guess and check

The mechanism is simple. Take a prompt, and have the model generate the solution many times. Because generation is stochastic, sampling one token at a time from a probability distribution, each attempt goes down a slightly different path. In practice you might sample thousands of independent attempts for a single prompt.

For a math problem you already know the final answer, so you can check each attempt automatically. Some reach the correct answer and some do not. You then train the model on the ones that worked, nudging it to be more likely to take those paths in future. Nobody labeled these solutions as correct. They came from the model itself.

Run this across tens of thousands of problems at once, and the model is effectively a student reviewing its own work and deciding how it should solve this kind of question. It finds token sequences that make no shaky mental leaps, that work reliably, and that use the knowledge it actually has. Guess many solutions, check them, do more of what worked.

Sample many solutions per prompt, not one.
Keep and train on the ones that reach the verified answer.
The correct solutions come from the model, not from a human.
SFT still matters: it puts the model near correct solutions so RL has something to refine.
Repeat across a large, diverse set of problems.

How thinking emerges

The DeepSeek-R1 paper made this stage public and showed what happens when RL is applied correctly to a language model. On math benchmarks, accuracy climbs steadily over thousands of training steps. That is expected. The striking part is how the model gets there.

As training continues, the average length of each answer grows. The model is choosing to use more tokens. Look at what it writes and you see phrases like wait, wait, that's an aha moment, let me reevaluate this step by step. The model reconsiders its own steps, tries a different angle, backtracks, and checks its work from another perspective before committing.

No human wrote those moves into a training example, because no human would know what to put there. They emerged from the optimization, purely because they raise accuracy. Karpathy calls these chains of thought, and they are the model rediscovering the cognitive strategies that happen in your head when you solve a hard problem, the part you never write down. Given only correct answers to aim at, the model taught itself to think across many tokens before answering.

This freedom to go past imitation is not new. DeepMind's AlphaGo showed it years earlier. A version trained only to copy expert human games plateaued below the very top players, because you cannot surpass the people you imitate. The version trained with reinforcement learning kept climbing and beat the strongest human, Lee Sedol. In one game it played move 37, a move a human would choose about one time in ten thousand. Commentators thought it was a mistake. In hindsight it was brilliant, a strategy that lay outside how humans play. Imitation is capped at human performance. Reinforcement learning is not, and language models are only starting to show the first hints of the same effect.

The only thing we gave it was the correct answers. The thinking came out on its own.

Reasoning models are trained with RL; the chain of thought is what lengthens their answers.
In ChatGPT, the o1 and o3 models reason this way; GPT-4o is closer to a plain SFT model.
For factual or simple questions, a thinking model is often overkill and slower.
Reach for a thinking model on genuinely hard math and code, and expect to wait while it thinks.

When there is no right answer

Math and code are verifiable. Any candidate solution can be scored automatically against a concrete answer, either by checking the boxed result or by using another model as a judge. That is why RL works so well there and can be run for a very long time. There is no way to fake solving the problem.

Now ask the model to write a joke about pelicans, or a poem, or a summary. There is no answer key. You could have humans rank the outputs, but reinforcement learning needs many thousands of updates over thousands of prompts, which would mean asking people to score jokes on the order of a billion times. That does not scale.

The fix is reinforcement learning from human feedback, or RLHF. Instead of asking humans to write ideal responses, show them a handful of the model's attempts and ask them to order them from best to worst, which is an easier task. This exploits the discriminator-generator gap: telling which poem is better is far easier than writing a good one. Train a separate network, the reward model, to imitate those human orderings. Now you have a simulator of human taste, and you can run RL against the simulator as often as you like.

Verifiable domains (math, code): score against a concrete answer, no humans needed at scale.
Unverifiable domains (writing, summaries): no answer key, so human judgment has to be involved.
RLHF collects human rankings, not human-written answers, which is a much easier ask.
A reward model learns to imitate those rankings so RL can run automatically.

Why you cannot run RLHF forever

RLHF genuinely helps. It lets you run reinforcement learning in creative domains and reliably produces a slightly better model. GPT-4o went through it. But the reward is a lossy simulation of human judgment, a giant neural network standing in for a real brain, and that is its weakness.

Reinforcement learning is very good at finding ways to game whatever it is scored against. Run RLHF for a few hundred updates and the jokes improve. Keep going and the results fall off a cliff. The model starts producing nonsense that the reward model, inexplicably, rates as perfect. Karpathy's example is that the top-scoring joke about pelicans becomes something like the the the, which the simulator scores near one even though it means nothing.

These are adversarial examples, inputs that slip into the cracks of the reward network and get fake high scores. You can add them to the training data with terrible scores, but there is always another one waiting. Karpathy's summary is that RLHF is not RL, not in the magical sense. On verifiable problems you either got the answer or you did not, the scoring cannot be fooled, and you can pour in more compute for genuinely better results. RLHF is more like a small fine-tune: run it a few hundred steps, then stop before the model learns to cheat.

You crop it, you call it, you ship it. Run RLHF too long and the model learns to trick the judge.

Watch the full 3.5-hour video

Read it, then build it with people.

Bring this to a free Oslo Vibe Coding drop-in and put it to work with people around you.

RSVP for the next session New to this? Start here