A Universal Post-Training Improvement for Open Audio Models
A robustness study of open autoregressive codec-TTS: a tiny test-time trick removes catastrophic failures across four models and three codecs, and we can distill it back into the model (paid once in training, not on every generation) where it actually matters.
Quick summary: Open source text-to-speech models perform well most of the time but occasionally fail badly: going silent, looping a syllable, or saying something that isn't the text. We found a simple technique that vastly improves quality: generate a few takes, let an ASR model pick a clean one, and the failures all but disappear. The method is easy to apply, it works on every model and codec we tried (four models, three codecs), and we can bake the behaviour back into the model so a single normal generation inherits most of it, but only where the model was actually failing. We share some samples of our results below, and we also report the things that didn't work, because they're the useful part.


