Distilled reasoning models trust the reasoning you give them more than the reasoning they write

June 25, 2026 · 8 min read

Ali Asaria

Person

Tony Salomone

Person

Deep Gandhi

Person

Reasoning models write out their thinking before they answer. We tested whether the answer actually depends on that thinking, on two distilled reasoning models. When we hand the model a chain of steps to follow, its answer tracks those steps almost perfectly. When the model writes its own steps, only about half of them actually change the answer, and the ones that matter sit near the end.

Quick summary: A reasoning model is a language model that writes a chain of intermediate steps (a "chain-of-thought") before giving a final answer. People read that chain as an explanation and use it to monitor what the model is doing, which only works if the answer really depends on the steps. We measured that dependence on two open models, DeepSeek-R1-Distill-Qwen 1.5B and 7B, by corrupting one step at a time and watching how much the model's confidence in its answer moves. Two settings, two different answers. When we provide the reasoning chain, the models follow it tightly: corrupting a step that matters collapses the right answer, corrupting an irrelevant step does nothing, and the two are cleanly separated on over 99% of problems. When the models reason for themselves on grade-school math, the dependence is real but thin: only about 43% (1.5B) and 53% (7B) of the steps they write actually move the answer, and that share rises through the chain, so the early steps carry the least. The bigger model is more faithful in both settings. One honest limit: the generated-reasoning result rests on 25 and 32 solved problems, so we report it as suggestive, not settled.

Why this matters

If you read a model's chain-of-thought to understand or audit its answer, you are trusting that the answer follows from the steps. A growing amount of safety work depends on exactly this: watch the reasoning, catch the problem. That trust is only earned if corrupting the reasoning changes the answer. So we tested it directly, on models small enough to study carefully and popular enough to matter. These are distilled reasoners: Alibaba's Qwen math models taught to imitate DeepSeek-R1's reasoning traces, with no reinforcement-learning step. We instrument the released checkpoints and report how they behave. Nothing of ours ships in the model.

How we measure whether a step matters

The idea is simple. Take a reasoning step, change it, and see whether the model's confidence in the original answer drops. We measure confidence as the log-probability the model assigns to that answer, and we always compare against a control: changing a step that should matter, versus changing one that should not. The gap between those two is the effect. A large effect means the step is load-bearing. A near-zero effect means the step is along for the ride.

We run this in two settings, and the contrast is the whole point.

In the provided setting, we write the reasoning ourselves: a short chain that builds a running total, with a few unrelated "aside" steps mixed in. Because we wrote it, we know which steps are load-bearing and which are decoys. In the generated setting, the model writes its own chain-of-thought to solve a grade-school math problem (GSM8K), and we corrupt one of its own steps. We only test problems the model gets right, so there is a real answer to move.

When we give the model reasoning, it follows it closely

On the provided chains, the models track the reasoning almost exactly. Corrupting a load-bearing step drops the log-probability of the correct answer by about 7 nats for the 1.5B model and 8.6 for the 7B. Corrupting an irrelevant aside moves it by roughly zero (-0.08 and -0.003). The two are separated on more than 99% of problems. The bigger model follows the chain more tightly than the smaller one, and we can say that with a real test: on the same problems, the 7B's effect is 1.45 nats larger on average, with a 95% confidence interval of [1.02, 1.88] from a paired bootstrap.

One caution we keep in view. Our provided chain restates the running total at each step, so a model could score well partly by tracking the latest stated number rather than reasoning about it. The near-zero effect of corrupting the decoy steps shows the models are at least selective about which steps they read, but this setting rewards careful value-tracking, and we do not claim it is a pure test of reasoning.

When the model writes its own reasoning, only half of it counts

Self-written reasoning is a weaker constraint on the answer. Corrupting a generated step does lower the model's confidence in its answer on average, and the effect clears zero (problem-clustered 95% intervals [1.23, 2.11] for the 1.5B and [1.55, 2.39] for the 7B). So the generated chain is used. It is just not all used. At our one-nat bar for "this step matters," and pooling every step in the chain regardless of where it sits, only 43% of the 1.5B's steps and 53% of the 7B's qualify, and the median step sits below the bar for the smaller model. About 30% of generated steps have a slightly negative effect, meaning corrupting them does nothing useful at all. That chain-wide figure also hides a strong gradient, which is what the next section pulls apart.

Where the load-bearing steps sit is the clearer story. We split each model's reasoning into thirds and measured the effect in each. It rises steadily from the start of the chain to the end. The diagram below shows the 1.5B; the 7B traces the same shape one step higher.

DeepSeek-R1-Distill-Qwen-1.5B, self-generated GSM8K reasoning. Both the average effect of corrupting a step and the share of steps that clear the load-bearing bar rise from the start of the chain to the end.

The early third of the 1.5B's reasoning barely moves the answer (mean effect 0.47, with 19% of steps load-bearing). The middle third sits at 34%, and the late third carries most of the weight (2.74, with 64% load-bearing). These bins are what the 43% chain-wide figure pools over: most early steps do nothing, most late steps matter, and averaging across all of them lands just under halfway. The headline number is held down by the early steps, not spread evenly across the chain. This is not just an artifact of comparing different problems: inside a single problem, a step's position and its effect are positively correlated (mean correlation 0.42, 95% interval [0.26, 0.57], positive in 83% of problems). Later steps do constrain the answer more. They are also closer to the answer and more likely to repeat a number that ends up in it, so we report the gradient as a robust pattern, not as proof that early reasoning is purely decorative.

How you measure it changes the answer

We also intervened by deleting a step instead of corrupting it. On the provided chain, deletion is a much weaker detector of load-bearing steps than corruption (it recovers the ground-truth steps at 0.66 area-under-curve, versus 0.995 for corruption). The reason is specific to our task: when a step is deleted, a later step still restates the running total, so the model recovers. The lesson is general, though. A single faithfulness number depends on how you poke the model, so the intervention has to be reported alongside the result.

Where this does not reach

We set out to also localize the computation inside the network, by patching activations between a clean and a corrupted run. That needs a task the model can solve and a chain with no restated values to leak the answer. Our scaffolded chain has the redundancy; an unscaffolded version removes it but neither model can solve it without writing a chain-of-thought first, so there is nothing clean to localize. We report this as a negative result. The clean place to do that localization is the self-generated setting, which is where we would take this next.

The honest ceiling: we studied two checkpoints from one model family, on arithmetic and grade-school math. The generated-reasoning numbers rest on 25 and 32 solved problems, so the size comparison there is suggestive rather than established, and we have not yet tested whether the same sparsity shows up on knowledge-heavy reasoning. GSM8K is also very likely in these models' training data, which is why the synthetic chain, which they have never seen, anchors the ground-truth claims.

Takeaways

A distilled reasoning model can track reasoning you hand it far more tightly than reasoning it writes itself. Treat a generated chain-of-thought as a partial explanation, not a complete one.
In self-written reasoning, the steps are not equal. The later steps carry most of the causal weight; the earliest steps carry the least.
Any single faithfulness score depends on the intervention. Corrupting a step and deleting a step give different verdicts, so report which you used.
Bigger helped. The 7B model was more faithful than the 1.5B in both settings.

The full method, the controls, and every number are in the paper: Whose Reasoning Is It? Distilled Reasoning Models Are Faithful to Provided Chains but Only Sparsely to Their Own. The intervention code and the synthetic-chain benchmark are available from the authors on request.

Why this matters​

How we measure whether a step matters​

When we give the model reasoning, it follows it closely​

When the model writes its own reasoning, only half of it counts​

How you measure it changes the answer​

Where this does not reach​

Takeaways​

Why this matters

How we measure whether a step matters

When we give the model reasoning, it follows it closely

When the model writes its own reasoning, only half of it counts

How you measure it changes the answer

Where this does not reach

Takeaways