Skip to main content

One post tagged with "chain-of-thought"

View All Tags

Distilled reasoning models trust the reasoning you give them more than the reasoning they write

· 8 min read

Reasoning models write out their thinking before they answer. We tested whether the answer actually depends on that thinking, on two distilled reasoning models. When we hand the model a chain of steps to follow, its answer tracks those steps almost perfectly. When the model writes its own steps, only about half of them actually change the answer, and the ones that matter sit near the end.

Quick summary: A reasoning model is a language model that writes a chain of intermediate steps (a "chain-of-thought") before giving a final answer. People read that chain as an explanation and use it to monitor what the model is doing, which only works if the answer really depends on the steps. We measured that dependence on two open models, DeepSeek-R1-Distill-Qwen 1.5B and 7B, by corrupting one step at a time and watching how much the model's confidence in its answer moves. Two settings, two different answers. When we provide the reasoning chain, the models follow it tightly: corrupting a step that matters collapses the right answer, corrupting an irrelevant step does nothing, and the two are cleanly separated on over 99% of problems. When the models reason for themselves on grade-school math, the dependence is real but thin: only about 43% (1.5B) and 53% (7B) of the steps they write actually move the answer, and that share rises through the chain, so the early steps carry the least. The bigger model is more faithful in both settings. One honest limit: the generated-reasoning result rests on 25 and 32 solved problems, so we report it as suggestive, not settled.