Everyone assumes bigger AI models win. We built a 34,000-parameter earthquake detector that beats a model 8Γ its size.
Everyone assumes bigger AI models win. We built an earthquake detector with 34,000 parameters β hundreds of times smaller than the models people usually reach for β and it beats a standard deep model 8 times its size. What mattered wasn't scale. It was one change to how it learns.
Quick summary: When an earthquake happens, a sensor records two bumps: a small early one (the P wave) and a bigger later one (the S wave). Software has to mark the exact instant each one arrives. The popular tools for this are large neural networks. We trained a tiny one instead β about 34,000 parameters (the values a network learns) β and tested it carefully so it could never peek at the answers. It scored 0.76 out of 1.0, where higher is better. A standard model called PhaseNet, about 8 times bigger, scored 0.64 on the exact same test. The deciding factor was not size. It was a single change to how we scored the model while it learned. We also tested a fashionable shortcut β self-supervised pretraining β and found it gave no real benefit here.
Background on seismic phase pickingβ
A seismometer is an instrument that records ground shaking. When an earthquake hits, two kinds of waves reach it a few seconds apart. The P wave arrives first β a small, quick jolt. The S wave comes next, and it's the bigger shake you'd actually feel.
A single sensor's recording of a distant earthquake. The ground is quiet, then the quick P wave arrives, the larger S wave follows, and the surface waves shake hardest of all. The model's job is to mark the exact instant the P and S waves begin.
The exact moment each wave arrives is the single most useful number in earthquake monitoring. From those two arrival times, scientists work out where the quake was, how big it was, and whether to send an alert. Marking those two moments on a wiggly recording is called phase picking.
People used to do this by hand with written rules. Now they use neural networks β programs that learn the pattern from labeled examples. Two of them, PhaseNet and EQTransformer, are the go-to tools, and they work well. But two habits in how they're built and graded are worth questioning.
First, they're big. A network's size is counted in parameters β the numbers it tunes while learning. The standard pickers carry hundreds of thousands of them, and nobody really asks whether they need to.
Second, they're usually graded on a test that's too easy by accident. One earthquake gets recorded by many sensors, and each recording is chopped into overlapping clips. If you shuffle all those clips and split them at random into a practice pile and a test pile, near-copies of the same quake end up in both piles. The model has effectively already seen the test answers during practice. That's called data leakage, and it quietly inflates the scores. A model can look excellent simply because it memorized quakes it was later "tested" on.
We wanted to drop both habits at once: build the model small, and grade it honestly.
A fair test, and a small modelβ
We used a public collection called STEAD β about 1.2 million earthquake recordings, each already marked by a human expert with the true arrival times. The crucial step was the split: we made sure every earthquake goes entirely into either the practice set or the test set, never both. No sneaky near-copies cross over.
We also graded strictly. A pick only counts as correct if it lands within a tenth of a second of the true P arrival, or two tenths of a second for the slower S wave. The score we report throughout is an F1 score on a 0-to-1 scale, where 1 is perfect. It rewards the model for catching the real P and S onsets at the right instant while not crying wolf on quiet stretches.
Our model is a small network with 33,610 parameters β tiny by today's standards, small enough to run on an ordinary computer with no special hardware.
The one change that matteredβ
Here's where the story turns, and it's the whole point.
The first attempt failed in a strange way. The model learned to find the S wave but completely ignored the P wave. Its P score was 0.00 β a total miss.
The reason is almost funny once you see it. In any recording, nearly every single moment is just background noise β more than 99 moments out of 100. When the model trains, it earns the most reward by labeling all that quiet correctly. So it took the lazy path: it labeled everything as noise, scored well on the 99%, and never bothered to learn the faint little P bump.
The fix wasn't a bigger network. It was changing how we score the model during training. We told it that catching the rare P and S moments is worth far more than labeling the common quiet β a so-called foreground-weighted loss. Suddenly the model had a reason to care about the 1% that actually matters.
That one change pulled the score on our tuning checks from 0.40 up to 0.78, and the P score from 0.00 up to 0.78. The S wave, louder and easier, was already fine. Same tiny model, same data β just a different definition of what "doing well" means. Careful supervision, not scale, is what made it work.
The result: small beats bigβ
The 0.78 above came from the tuning checks we used while building the model. The real test is data the model has never touched. On those held-out recordings the score settles slightly to 0.76 (0.78 for P, 0.75 for S) β essentially unchanged, which is what you want to see: the model wasn't just memorizing its practice set. We then ran the standard PhaseNet through the exact same test β same split, same grading β for a true side-by-side match. PhaseNet scored 0.64, with about 270,000 parameters, roughly 8 times more than ours.
Size: our model is about 8 times smaller. Now press Score.
So on this job, the smaller model is the more accurate one. Two honest caveats: we scored PhaseNet through our own pipeline rather than retraining it on our split, so its number might shift with fresh training; and our own result comes from a single training run. The point survives both. A picker small enough to run almost anywhere gives up nothing in accuracy to be that small.
Where it breaksβ
The model has real limits, and they're worth stating plainly.
How well it does depends mostly on how far away the quake is. For nearby quakes it scores 0.93. For quakes more than 150 kilometers away, where the signal is faint, it drops to 0.38 (measured on only 54 far-away quakes, so that figure is rough).
It also raises false alarms on quiet data: given a stretch with no earthquake at all, it still calls a P pick 69% of the time and an S pick 51% of the time. In a real deployment you'd put a separate earthquake-detection step in front of it to filter those out. Our score measures picking the arrivals inside a clip that already contains a quake β not finding the quake from scratch.
Takeawaysβ
- Bigger didn't win. A 34,000-parameter model beat a standard picker 8Γ its size on a fair, leakage-free test. For this job, accuracy wasn't about scale.
- One change carried the result. Rewarding the model for the rare P and S moments β not the common quiet β is what made the small model work. Without it, it never learned the P wave at all.
- Know the limits. The model degrades on far-away quakes and raises false alarms on quiet data, so it belongs behind a separate earthquake detector, not on its own.
The reflex in modern AI is to reach for a bigger model. But here the win came from teaching a small one to care about the rare moments that matter β a fix in how it learns, not how large it is. When the problem is well understood, careful supervision can beat raw scale, and the payoff is a model light enough to run almost anywhere.
The code and the trained model are available on request.