Teaching a language model to say "I'm not sure" using its own doubt
· 8 min read
A model already has a quiet sense of when it is bluffing. We turned that sense into an off-switch for confident wrong answers, without ever telling it which answers were right.
Imagine you're a student, and the teacher asks you a math question in front of the class. As you start to answer, you have an internal feeling about how well you actually know this topic. That feeling shapes everything: how quickly you speak, how long you pause to think, whether you hedge or commit. You aren't just producing an answer; you're monitoring your own confidence as you go and adjusting based on how you feel about the thoughts forming in your mind.

