Skip to main content

Scaling a protein model is what lets RL beat trial-and-error protein design

· 8 min read

Designing a protein means searching a huge space of options while each real test costs money. A popular approach is to fine-tune a big protein AI with reinforcement learning. Our lab ran a fair contest against plain trial-and-error. The AI only wins once you make the model bigger, and the win comes from giving the search a better starting point, not from the model knowing which proteins are good.

Quick summary. A protein is a string of building blocks called amino acids. To design a good one, you try candidates and measure how well each works, a number we call its "fitness". Each measurement is expensive, so you want good ones in as few tries as possible. We compared two search methods on a well-studied protein called GB1, giving both the exact same number of tries: reinforcement learning (RL) that fine-tunes a protein language model, and a simple classical method (directed evolution, the "keep mutating the best ones" approach). Two findings. First, the protein model's built-in hunch about which proteins are good is almost useless here: its guess barely correlates with real fitness (0.04, where 1.0 would be a perfect match and 0 is none). Second, with a small model RL only ties trial-and-error, but with a bigger model RL beats it by about 10% when tries are scarce. The surprising part is the reason. The bigger model still cannot tell good proteins from bad ones. It helps the search only by giving it a better starting point.

Cheaper search saves real lab time and money​

Lots of important work is a search under a tight budget: find a protein that binds, a molecule that is stable, a sequence that does a job, while each lab test is slow and costly. A method that finds good candidates in fewer tests saves real time and money. So it matters whether the expensive tool (RL on a large protein model) actually beats the cheap, old tool (trial-and-error), or just looks good against weak comparisons.

Most papers that report "RL wins" compare against weak opponents, and often against a scoring model that the optimizer can quietly cheat. We wanted an honest answer, so we designed a test that cannot be gamed.

How we set up the comparison​

We used GB1, a small protein where researchers have already measured the fitness of almost every possible variant: about 149,000 of the 160,000 combinations at four key positions. Because the answer key is complete, we can grade any candidate exactly, with no scoring model to fool. We gave every method the same budget of unique tries, started them from the same weak starting points, and repeated each run five times, so a single lucky seed cannot drive the result.

Our strong classical baseline is simulated annealing, a standard form of directed evolution: start somewhere, make small random changes, keep the good ones, and occasionally accept a worse one to escape dead ends. On GB1 it is very strong. It essentially solves the landscape, reaching high scores reliably within about 1,000 tries and the single best variant by around 5,000.

The protein model's hunch is nearly useless​

Before any RL, we checked a simple thing: does the protein language model already know which GB1 variants are good? A protein language model is an AI trained on millions of natural protein sequences, so it has a sense of what looks "protein-like." We scored every variant by that sense and compared it to real fitness.

The match is almost nothing. The correlation is 0.04 for the small (35M-parameter) model and 0.15 for the bigger (150M) one, where 1.0 would be perfect and 0 is no relationship. The variants the model ranks highest have near-zero real fitness. So just trusting the model, with no search, is weak. The model's prior, its built-in guess before any search, does not already know the answer.

Small model: RL only ties​

Now the contest. We fine-tuned the protein model with GRPO, a reinforcement-learning method that rewards the model for proposing higher-fitness variants. With the small (35M) model, carefully tuned, RL matches annealing but does not beat it. At a budget of 1,000 tries it reaches a top score of 4.49 versus annealing's 4.34. That is a tie within the run-to-run noise. It also fails a target we set in advance: beat the strongest classical method by at least 10%. It never does, at any budget, with the small model.

Bigger model: RL wins​

Swapping in the bigger (150M) model changes the outcome. Press the button below to scale the model up. Watch the same contest flip from a tie to a win.

Trial and error (annealing)
4.34
RL on the protein model
4.49

With the small model, RL reaches 4.49 versus annealing’s 4.34. That is a tie, not a win. Cheap trial and error keeps up.

Same problem, same query budget (1,000 tries). Scaling the protein model is what turns a tie into a win.

At 1,000 tries the bigger model's RL reaches 4.77 versus annealing's 4.34, about 10% better for the same number of tests, and the gap holds up across five repeats. Scaling the protein model is what lets RL overtake a strong classical optimizer.

The win lives where tries are scarce​

The advantage is not uniform. It is largest exactly where it matters, when you can only afford a few tests, and it shrinks as the budget grows and both methods run out of room near the best possible protein. Press play to sweep the budget from few tries to many.

Budget: 100 tries
Trial and error
0.93
RL (150M model)
0.36

RL behind (too few tries)

RL’s lead is largest when tries are scarce (around 1,000), then shrinks to a tie as both methods run out of room near the best possible protein.

With very few tries (100) RL has not gotten going and loses. Around 1,000 to 3,000 tries RL leads by roughly 10%, 7%, 5%, and 2%. By 5,000 tries and beyond, annealing has caught up and the two tie, because both are bumping against the global best. The useful regime is the sample-limited one, which is also the regime that matters when each test is a real experiment.

What the bigger model actually buys you​

The bigger model wins, but it is still a bad judge of fitness: its ranking of variants is just as poor (correlation 0.15, top picks near zero). So the gain does not come from the bigger model knowing which proteins are good. It comes from giving the search a better starting point. The larger model is a better place to begin from, even though it is no better at telling which proteins work. We did not isolate this mechanism fully, so we report it as the most likely reading rather than a proven cause.

The things that did not work (the useful part)​

Getting RL to work at all took care, and the failures are informative.

  • RL collapses by default. Left alone, the policy quickly latches onto a few variants it has already found and stops exploring. In one setting it spent only 19 of a 1,000-try budget before getting stuck. The fix was to stop rewarding it for re-proposing variants it already knows, add a small bonus for staying varied, and use a low learning rate. Of these, the learning rate mattered most.
  • The model's hunch did not help on its own. Ranking variants by the protein model, with no search, stays weak at both sizes. The prior is not a shortcut here.
  • At the budget we pre-committed to (5,000 tries with the small model), there is no win. We report that as a clean negative result, not a footnote.

Where this does not hold​

This is one small, fully-measured protein landscape and a two-point jump in model size (35M to 150M). Two points do not prove a scaling law, and a second protein could behave differently. We also matched the budget on the expensive thing: the number of lab-style tests. We did not match raw computation. On GB1 a "test" is a cheap table lookup. So RL's edge in tries is not a speed win here. It would matter in a real setting where a single test (a wet-lab assay) costs far more than running the model. Because this win sits at a smaller budget and larger model than our pre-registered target, we treat the scaling result as exploratory, a strong signal worth confirming on more landscapes.

Takeaways​

  • Test RL for design against a strong classical baseline at a matched query budget. On GB1, simple trial-and-error nearly solves the problem, so a weak baseline would have made RL look better than it is.
  • A protein model's "hunch" is not a fitness oracle. Here it barely correlates with real fitness, so do not assume the prior carries the answer.
  • Model scale can change the verdict. The same RL method ties at 35M and wins at 150M, and the benefit seems to come from a better starting point, not better fitness knowledge.
  • Watch for collapse. Reward the search for finding new good variants, not for repeating ones it already has.

The full paper, with the statistics, the per-budget breakdown, and the limitations in detail, is here: Does the Prior Pay Off?. GB1 data is public via the FLIP benchmark; our code and configurations are available on request.