One post tagged with "post-training"

Training an AI to be correct collapses its variety

June 22, 2026 · 11 min read

Ali Asaria

Person

Tony Salomone

Person

Deep Gandhi

Person

We trained language models to invent Sokoban puzzles and paid them only for puzzles that are solvable and the right difficulty, never for being different from each other. As the reward went up, the models stopped inventing and settled on a single puzzle they repeated over and over. The good news: the trade-off is measurable, and you can control it.

Quick summary: Many modern AI models are trained to be correct with a method called reinforcement learning from a verifiable reward (RLVR): the model tries something, an automatic checker says right or wrong, and the model is nudged toward the right answers. We used a clean test case, generating solvable Sokoban puzzles, to ask what this does to the variety of what a model produces. The answer is a sharp collapse. As the model gets better at earning the reward, the fraction of its puzzles that differ from each other drops from about 1.0 (nearly all different) to under 0.05 (nearly all the same), often within 50 to 150 training steps. This held across three versions of the training method, three reruns from different random starts, and two model sizes. The collapse is also controllable. You can stop training at the "knee" of the trade-off curve and keep most of the variety, larger models trade off more gently, and a reward that pays a small bonus for new shapes removes exact copying. That bonus has a real limit: it only weakly restores genuine variety at the sizes we tested. And the specific clipping trick people argue about online (PPO versus DAPO versus DPPO) turned out not to be a reliable lever here.