Skip to main content

3 posts tagged with "reinforcement-learning"

View All Tags

Scaling a protein model is what lets RL beat trial-and-error protein design

· 8 min read

Designing a protein means searching a huge space of options while each real test costs money. A popular approach is to fine-tune a big protein AI with reinforcement learning. Our lab ran a fair contest against plain trial-and-error. The AI only wins once you make the model bigger, and the win comes from giving the search a better starting point, not from the model knowing which proteins are good.

When an AI searches for the equation behind your data, one dial steers it

· 10 min read

Our lab trained a model to rediscover physics equations from raw data. We wanted to know which training dial actually controls how hard it searches. The simplicity penalty does. A popular clipping knob barely does.

Picture a table of numbers. One column is how far a pendulum swings, another is how long each swing takes, and a third is the answer you care about. Somewhere behind those numbers is a short formula that produced them. A human physicist might stare at the table and guess that the swing time grows with the square root of the pendulum's length. The goal of this project was to get a machine to make that kind of guess on its own: read the data, propose a compact equation, and check whether it matches the law that really generated the numbers.

Training an AI to be correct collapses its variety

· 11 min read

We trained language models to invent Sokoban puzzles and paid them only for puzzles that are solvable and the right difficulty, never for being different from each other. As the reward went up, the models stopped inventing and settled on a single puzzle they repeated over and over. The good news: the trade-off is measurable, and you can control it.

Quick summary: Many modern AI models are trained to be correct with a method called reinforcement learning from a verifiable reward (RLVR): the model tries something, an automatic checker says right or wrong, and the model is nudged toward the right answers. We used a clean test case, generating solvable Sokoban puzzles, to ask what this does to the variety of what a model produces. The answer is a sharp collapse. As the model gets better at earning the reward, the fraction of its puzzles that differ from each other drops from about 1.0 (nearly all different) to under 0.05 (nearly all the same), often within 50 to 150 training steps. This held across three versions of the training method, three reruns from different random starts, and two model sizes. The collapse is also controllable. You can stop training at the "knee" of the trade-off curve and keep most of the variety, larger models trade off more gently, and a reward that pays a small bonus for new shapes removes exact copying. That bonus has a real limit: it only weakly restores genuine variety at the sizes we tested. And the specific clipping trick people argue about online (PPO versus DAPO versus DPPO) turned out not to be a reliable lever here.