Skip to main content

Teaching a language model to say "I'm not sure" using its own doubt

· 8 min read

A model already has a quiet sense of when it is bluffing. We turned that sense into an off-switch for confident wrong answers, without ever telling it which answers were right.

Imagine you're a student, and the teacher asks you a math question in front of the class. As you start to answer, you have an internal feeling about how well you actually know this topic. That feeling shapes everything: how quickly you speak, how long you pause to think, whether you hedge or commit. You aren't just producing an answer; you're monitoring your own confidence as you go and adjusting based on how you feel about the thoughts forming in your mind.

That sense of your own thinking has a name. Scientists call it metacognition — thinking about your thinking — and it runs on two gears:

  • Monitoring. Your brain constantly watches itself work: how easily a sentence is parsing, how much effort a math problem is taking, whether a memory feels crisp or degraded. This monitoring is what produces those epistemic feelings — the felt sense of "I've got this" or "I'm not sure."
  • Control. Once your brain feels how things are going, it acts on it. Hit a paragraph that feels confusing, and your control system tells your eyes to stop and re-read it. Feel unsure of an answer, and you slow down, qualify it, or decline to give one.

Large language models are missing this loop. We wanted to investigate whether we could give an LLM at least one piece of metacognition — and whether it would improve what the model generates.

To see what's missing, step back and look at how an LLM actually writes. It generates one token (roughly, one word-piece) at a time. To pick the next token, it maps out a probability for every possible candidate and selects one of the likely ones — not always the very top one, since a little randomness is what gives it some creativity. But here's the catch: the moment a token is chosen, the probability behind it — its logprob, the model's own confidence in that choice — is discarded. The chosen token is appended to the history and fed back in, and from then on it's written in stone. No matter how unsure the model was when it picked a word, the past is replayed as though it were 100% certain.

How a language model writes — one token at a time
TheEiffelTowerisinâ–®
next-token probabilities (logprobs)
Paris0.93
France0.04
Lyon0.02
the0.01
The LLM chooses the next token and discards the logprobs. Whatever confidence it had in that pick is gone. Only the token is appended to the history and fed back in — replayed from then on as if it had always been certain.

Perhaps this is part of why LLMs are so prone to hallucination — stating false facts in the same fluent, confident voice they use for true ones. The model's own hesitation, the dip in confidence right as it goes off the rails, is thrown away at the exact moment it could be useful. So we asked: what if we fed that signal back? What if the model could see not just what it said before, but how confident it was when it said it?

First, is there even a signal?​

Before reshaping how the model thinks, we wanted to confirm there was something there to reshape. The confidence score is simple to compute: take the probability the model assigned to each word in its answer and average them into a single number. The question was whether that number actually means anything — does it dip before the model goes wrong?

So we checked it against datasets of known hallucinations. Good news: across these hallucinated datasets we found a clear pattern of lower confidence scores right before hallucinated answers. Concretely, across six models the score separates right answers from wrong ones well above chance — AUROC 0.73–0.86 — and the pattern holds even for a tiny 1-billion-parameter model. The signal is real, and it's there to use.

So the model already has a doubt meter. It just doesn't act on it.

The usual fix is expensive​

How do people normally teach a model to abstain? They build an answer key. You take a big set of training questions, find out the true answer to each one, check whether the model got it right, and then train it to decline exactly the ones it got wrong. This works, but it's costly. Knowing the right answer to every training question is the whole expense. (In our experiments we call this the label-supervised baseline, "C2", trained in the style of R-Tuning.)

That's the moment we got curious. The model already produces a doubt signal, for free, on every question. What if its own doubt could stand in for the answer key?

Our method: let the model's doubt write the lesson plan​

First, how should the model voice its doubt at all? We considered a few options, including the heavy-handed one: training a brand-new "doubt token" the model could emit mid-stream whenever its confidence sagged. In the end we picked something far simpler — when the model is unsure, we inject or replace its answer with a short written hedge that reads like a flash of internal reflection: something like "(I am not sure)" in brackets, triggered when the confidence across the answer's tokens drops below a threshold. No architecture changes, no new vocabulary.

With that decided, the rest of the recipe is almost suspiciously simple. Here's the whole thing, one example at a time:

1Generate
2Read its doubt
3Relabel
4Fine-tune
5Abstains
The frozen base model answers a training question on its own.
Who directed the 1971 film "Wake in Fright"?
→
Nicolas Roeg

In words: pick a cutoff on the confidence score; every question where the model's confidence falls below the cutoff gets its training target changed to "I'm not sure," every question above the cutoff keeps the model's own answer; then we fine-tune the model (cheaply, with LoRA) on that homemade dataset.

The key move: we never look at whether any answer is actually correct. The model's confidence in its own words is the only thing that decides. No answer key.

Step 3 is where it all happens. Drag the cutoff below and watch the training set rebuild itself:

above the cutoff → keep the model’s answer below the cutoff → teach it to say “I’m not sure”
What is the capital of France?
96%
training target: Paris
Who wrote the novel "The Black Dahlia"?
79%
training target: James Ellroy
What is the atomic number of tungsten?
61%
training target: 74
What is the capital of Australia?
55%
training target: Sydney
In what year was the painter Élisabeth Vigée Le Brun born?
38%
training target: “I’m not sure.”
Who directed the 1971 film "Wake in Fright"?
27%
training target: “I’m not sure.”
The cutoff is the only knob. We build the training set by hand of the model itself: everything below the cutoff is relabelled to “I’m not sure,” everything above keeps the model’s own answer. Then we fine-tune on that. Notice what decides each card: the model’s confidence in its own words, never whether the answer is actually correct. No answer key is consulted. (Right now the model would answer 4 of 6 and abstain on the rest.)

A few things worth noticing as you slide:

  • The cutoff is the only dial. Push it right and the model becomes cautious (answers less, abstains more); push it left and it stays talkative. You're choosing how much doubt is "too much."
  • Nothing here is the answer key. The cards flip based purely on the model's confidence in itself. The method is blind to correctness, and that's what makes it free.
  • The signal is frozen. We compute each confidence score once, from the original model, before any training. The model can't game a number that's already locked in.

We compare this label-free version ("C3") against three honest controls trained the exact same way: the untrained model, plain fine-tuning on its own answers, and the answer-key method (C2). We also added a sneaky control that drills the hard questions harder instead of abstaining, to check the model isn't just memorizing.

Did it work? (the short, method-first version)​

Yes, and the comparison that matters is doubt vs. answer key, head to head. Judged by an independent model (we used gemma-2-27b-it, deliberately from a different family than the models under test, so it can't play favourites), and compared at the same answer rate, the free-doubt recipe held its own against the answer-key recipe on all six models, and we couldn't find a difference between them. You got the benefit of an answer key without paying for one.

Two more things we cared about:

  • It's real calibration, not memorization. The "drill harder" control didn't reduce hallucination at all. Only abstaining on low-confidence questions helped, which is the difference between a model that's learned when to trust itself and one that's just crammed.
  • The honest catch. The doubt meter has one blind spot: facts the model is confidently wrong about. If it never flinched, our signal never sees it. Those confident-wrong answers are the main thing that slips through, and they're the obvious target for follow-up work (richer doubt signals, like asking the model to grade itself).

Why this is neat​

The headline isn't "we beat the baseline." It's that a model's own confidence is a near-free substitute for an expensive labelled dataset when the goal is teaching it to abstain. The information you were about to pay to collect is, to a useful degree, already sitting inside the model; you just have to read it off and feed it back.

How we ran it​

The whole study, covering six models across two families (Llama and Qwen3, from 1B to 8B), the baselines, the controls, the sweeps, and the judged evaluation, ran on Transformer Lab driving GPUs on Lambda Cloud, for about 16 GPU-hours total. It's a LoRA-scale recipe, not a giant training run. Code and artifacts are available on request.

If you want to try the idea on your own model, the ingredients are: greedily decode answers, record the average log-probability over each answer, pick a cutoff, relabel the low-confidence half to "I'm not sure," and LoRA-fine-tune. That's the whole thing._

Full paper: Can a Model Catch Its Own Hallucinations for Free?.