← All papers
Reward Maximization Collapses Generative Diversity: Characterizing and Controlling the Trade-off in Verifiable Procedural Generation
Verifiable-reward RL (RLVR) for procedural Sokoban-level generation triggers a sharp reward↔diversity phase transition: the model mode-collapses to one or two level templates (distinct-valid fraction ≈1.0 → <0.05) across three trust-region objectives (PPO, DAPO, DPPO), seeds, and scales. The trade-off is controllable — passively via early-stopping at the Pareto knee, actively via a novelty-bonus reward — but which clipping objective is used is not a robust lever for diversity.