Research
Research from the Transformer Lab team.
18 articles
The best-sounding open-source music generator is also the only one you can steer
Vendor reports for text-to-music models rank systems with Frechet Audio Distance (FAD) and CLAP score computed on a single CLAP encoder, the same family a model may be trained against, raising a circularity concern. We ask two questions about three open-source instrumental genera…
How small can an earthquake detector be? A model 8 times smaller that still wins
Deep neural pickers such as PhaseNet and EQTransformer are the default tools for detecting P- and S-wave arrivals, but they carry hundreds of thousands of parameters and are usually benchmarked on randomly split data, where near-duplicate windows of the same event leak between tr…
Distilled reasoning models trust the reasoning you give them more than the reasoning they write
Reasoning models write a chain-of-thought (CoT) before answering, and that trace is increasingly read as an explanation. We ask whether the answer's probability actually depends, under intervention, on the reasoning steps, and whether this differs when the reasoning is provided t…
Scaling a protein model is what lets RL beat trial-and-error protein design
A deliberately fair, matched-query-budget comparison of GRPO reinforcement-learning fine-tuning of a protein language model (ESM-2) against classical directed evolution on the GB1 four-site fitness landscape, using an exact-lookup oracle (nothing to game), a strong simulated-anne…
What's the best way to run a 30B coding model on a 24 GB Mac?
Asks whether a 30B-total / 3B-active MoE coding model (North-Mini-Code-1.0, 128 experts, top-8, Apache-2.0) can be quantized to beat its vendor 4-bit while still fitting a hard ≤24 GB Apple Silicon budget — and finds round-to-nearest (RTN) 4-bit is the ceiling, a careful negative…
Does fine-tuning a chatbot make it more brain-like? We checked
Reports that instruction-tuned models are more "brain-aligned" than their base versions are mostly a chat-template artifact, not a property of alignment training. Under identical raw text, post-training weight changes leave fMRI encoding alignment essentially unchanged (Qwen base…
When an AI searches for the equation behind your data, one dial steers it
A mechanistic study of how an RL objective shapes symbolic-regression search: the parsimony coefficient λ cleanly and monotonically sets the operating point on the accuracy–parsimony frontier, while the DAPO clip-higher asymmetry does not — it is the entropy regularizer, not the…
Training an AI to be correct collapses its variety
Verifiable-reward RL (RLVR) for procedural Sokoban-level generation triggers a sharp reward↔diversity phase transition: the model mode-collapses to one or two level templates (distinct-valid fraction ≈1.0 → <0.05) across three trust-region objectives (PPO, DAPO, DPPO), seeds, and…
How modular is a frontier Mixture-of-Experts?
A pre-registered causal test of whether the experts in a frontier MoE (Command A+, 218B total / 25B active, 128 experts) form functional modules tied to capabilities or languages. Of six pre-registered expert families ablated at inference time against a size-matched random-expert…
Teaching a language model to say "I'm not sure" using its own doubt
A model's own token-probability "doubt" signal, used to decide when to abstain, matches label-supervised abstention-tuning without using any correctness labels. Across six open-weights models (1B–8B, two families) on short-form QA, the label-free LoRA recipe shows no statisticall…
How I built my own Tesla-style self-driving AI
A DINO-pretrained ViT-S is competitive with a pretrained ResNet-50 at end-to-end steering-angle prediction on a small slice (~5–16k frames) of the comma2k19 driving dataset — turn-slice Pearson 0.964 vs 0.967. Competitiveness is conditional on adaptation (low-LR full fine-tuning,…
Judging to improve: a 3D judge you can train against, and the limit of cheap 3D specialization
A trainable, de-biased VLM-as-judge for single-image 3D generation — one VLM family labels training pairs, a different family scores, and verdicts only count when they survive an order swap. Used to test cheap label-free adaptation of a strong base: six methods reach only parity…
Train, retrieve, or both? What it takes to make a language model cite the law correctly
A four-arm head-to-head (base, LoRA SFT, RAG, SFT+RAG) for correct statutory citation on Ontario tenancy law. The base model hallucinates 81% of its citations; retrieval is the decisive lever, driving hallucinations to zero by construction and lifting citation exact-match to 0.44…
Judging single-image 3D generation without humans (and why cheap proxies fall short)
A standardized evaluation protocol for single-image-to-3D mesh generators, using 24-view rendering and position-bias correction — and showing that common proxies like CLIP similarity and geometry-validity metrics don't substitute for a VLM judge.
A Universal Post-Training Improvement for Open Audio Models
ASR-based self-verification drives catastrophic failures (silence, early termination, repetition) to near zero in autoregressive neural-codec TTS, then distills the behavior for inference-time efficiency — generalizing across four TTS systems and three codecs.
Neither parallel nor sequential: how DiffusionGemma actually commits tokens
A close look at token-commitment patterns in DiffusionGemma 26B. Contrary to parallel-decoding marketing, the behavior is neither parallel nor block-autoregressive — weak left-to-right bias and substantial within-batch ordering ambiguity.
Making INT8 actually fast: a fused kernel for Ideogram 4 on a 3090
A fused Triton kernel that properly drives the INT8 tensor cores on consumer Ampere GPUs — ~1.1× end-to-end speedup, making 1024px generation feasible on a single RTX 3090.
Quantizing Ideogram 4.0 onto a 3090: an INT8 build that matches FP8 and a 4-bit GGUF that beats NF4
Post-training quantization of Ideogram 4.0 where INT8 W8A8 comes out statistically indistinguishable from FP8 on key quality metrics, with INT8 and GGUF Q4_K both cutting compute for consumer-GPU deployment.