Why LLMs make things up, and what fixes it

The takeaway

The model has no built-in sense of "I don't know," so it fills gaps with confident guesses. Its baked-in knowledge is a vague recollection; the words in front of it are precise working memory. Put the facts in the context, tell it to use tools for anything factual or numerical, and verify the result.

The confident guess

Ask a model who Orson Kovats is. There is no such person; Karpathy made the name up. A well-trained assistant now says it doesn't recognize the name. Older models did something else. They produced a fluent, confident little biography, entirely invented.

The reason sits in how the model was trained. Its training data is full of conversations shaped like "who is X," and in every one, the answer is a confident, correct paragraph. The model learned the shape of that answer. When you feed it a name it has never seen, it still reaches for the same shape and fills in the blanks with whatever is statistically plausible.

At each step the model is choosing the next token from a probability distribution. It does not check a database. It does not pause to notice the gap in its knowledge. It samples a likely-looking word, then the next, and the result reads to you like a fact. Karpathy calls these models statistical token tumblers, and that is exactly the failure mode: a smooth continuation with no anchor to truth.

No sense of not knowing

The strange part is that the model may internally represent its own uncertainty. Somewhere in the network there is probably a signal that lights up when it is on unfamiliar ground. The problem is that this signal was never wired to the words "I don't know." So the model stays quiet about its doubt and takes its best guess out loud instead.

This is why the fix is not a clever prompt. Meta described the fix for the Llama 3 models: interrogate the model about facts, ask the same question several times, and see whether the answers are stable and correct. Where the model consistently gets it wrong, add training examples where the correct answer is "I don't know" or "I don't remember."

A few thousand of those examples let the model connect its internal uncertainty to an actual refusal. It learns that when the doubt signal is high, the right move is to say so. That single association is most of why newer assistants decline to invent an Orson Kovats where older ones happily obliged.

Vague memory versus working memory

Here is the mental model worth keeping. The knowledge packed into the model's billions of parameters is a vague recollection. Karpathy compares it to something you read a month ago. If you read it many times, you remember it well; if you saw it once, your recall is fuzzy and unreliable. The model is the same. Facts that appear all over the internet come back sharp. Rare facts come back blurry, or wrong.

The context window is different. That is the text sitting right in front of the model, the words in your prompt and the conversation so far. Karpathy calls it working memory, and it is precise. The model doesn't have to recall anything from the context; it can read it directly.

This distinction changes how you should prompt. Ask a model to summarize chapter one of Pride and Prejudice and it does a passable job from its fuzzy memory of a famous book. Paste the actual chapter into the prompt and the summary gets sharply better, because now it is working from the text instead of a recollection of it. You would write a better summary after rereading the chapter too. Same mechanism.

Knowledge in the parameters is a vague recollection. Knowledge in the context window is working memory.

Parameters: broad, baked-in, fuzzy for anything rare.
Context window: precise, but limited to what you put there.
If a fact matters, put it in the context rather than trusting recall.
Common facts recall well; obscure ones are where hallucination hides.

Tools as the fix

When you don't know a fact, you look it up. The model can do the same, if you let it. Modern assistants can emit a special token that triggers a web search. The search runs, the resulting text gets pasted into the context window, and now the model answers from that fresh text instead of from memory. The lookup refreshes its working memory, exactly like you refreshing yours.

The same trick covers arithmetic and counting. Ask a model to count the dots in a big block and it guesses, because it tries to do the whole count in a single forward pass with almost no room to work. Tell it to use code and it writes a short Python snippet, runs it, and reads back the exact answer. The Python interpreter does the counting; the model just orchestrates.

The lesson generalizes. For anything factual, recent, or numerical, the model does better leaning on a tool than trusting its own head. When you want it to rely on memory instead, you can say "don't use any tools," and it will.

Web search pulls facts into the context so the model reads instead of recalls.
The code interpreter computes exact answers instead of guessing them.
Default to tools for facts, dates, math, and counting.
You can turn tools off explicitly when you want to test raw recall.

Why spelling and counting break

For a long time, every top model insisted there are two R's in strawberry. There are three. This looks absurd next to a model that solves math-olympiad problems, and the explanation is tokenization. The model never sees letters. It sees tokens, which are chunks of text, and "strawberry" arrives as a couple of chunks, not a string of characters. Counting the R's means seeing the R's, and the model can't.

The same wall shows up when you ask it to print every third character of "ubiquitous." That word is three tokens to the model. You can index into individual letters because your eyes see them; the model has no direct access to the characters hidden inside its tokens. Character-level tasks fail for this reason, not because the model is dim.

Then there are the failures that are just weird. Ask which is bigger, 9.11 or 9.9, and the model often says 9.11. One investigation found that the neurons lighting up were the ones associated with Bible verses, where 9.11 does come after 9.9. The model gets cognitively distracted and lands on the wrong answer, even while trying to justify it with arithmetic. Some sharp edges make sense once you know about tokens. Others leave you scratching your head.

The model can win a math olympiad and still tell you 9.11 is bigger than 9.9.

The swiss cheese

Karpathy's image for all of this is swiss cheese. The models are excellent across an enormous range of tasks, then fail randomly in some specific spot. The competence is real and the holes are real, and the holes don't line up with anything intuitive. A model that reasons through a physics problem can trip on counting letters in a common word.

So don't treat these systems as infallible. Check their work. Use them for a first draft, for inspiration, for the heavy lifting, and stay responsible for the final product yourself. That is not a knock on the technology. It is the correct way to hold a tool that is brilliant and unreliable in the same breath.

The practical version is short. Put the facts the model needs into the context instead of trusting its memory. Ask it to use search for anything factual and code for anything numerical. Then read the answer with the swiss cheese in mind, because the next hole is somewhere you won't expect.

Feed facts into the context; don't rely on baked-in recall.
Route factual and numerical work through tools.
Verify anything that matters before you ship it.
Expect competence and random failure to coexist.

Watch the full 3.5-hour video

Read it, then build it with people.

Bring this to a free Oslo Vibe Coding drop-in and put it to work with people around you.

RSVP for the next session New to this? Start here