Research

Research from the Transformer Lab team.

18 articles

The best-sounding open-source music generator is also the only one you can steer

Salomone, Gandhi, Asaria·June 25, 2026AUDIOEVALUATION

Vendor reports for text-to-music models rank systems with Frechet Audio Distance (FAD) and CLAP score computed on a single CLAP encoder, the same family a model may be trained against, raising a circularity concern. We ask two questions about three open-source instrumental genera…

How small can an earthquake detector be? A model 8 times smaller that still wins

Asaria, Salomone, Gandhi·June 25, 2026EfficiencySeismology

Deep neural pickers such as PhaseNet and EQTransformer are the default tools for detecting P- and S-wave arrivals, but they carry hundreds of thousands of parameters and are usually benchmarked on randomly split data, where near-duplicate windows of the same event leak between tr…

Distilled reasoning models trust the reasoning you give them more than the reasoning they write

Asaria, Salomone, Gandhi·June 25, 2026LLMINTERPRETABILITY

Reasoning models write a chain-of-thought (CoT) before answering, and that trace is increasingly read as an explanation. We ask whether the answer's probability actually depends, under intervention, on the reasoning steps, and whether this differs when the reasoning is provided t…

Scaling a protein model is what lets RL beat trial-and-error protein design

Asaria, Salomone, Gandhi·June 24, 2026RLLLM

A deliberately fair, matched-query-budget comparison of GRPO reinforcement-learning fine-tuning of a protein language model (ESM-2) against classical directed evolution on the GB1 four-site fitness landscape, using an exact-lookup oracle (nothing to game), a strong simulated-anne…

What's the best way to run a 30B coding model on a 24 GB Mac?

Salomone, Asaria, Gandhi·June 23, 2026SYSTEMSLLM

Asks whether a 30B-total / 3B-active MoE coding model (North-Mini-Code-1.0, 128 experts, top-8, Apache-2.0) can be quantized to beat its vendor 4-bit while still fitting a hard ≤24 GB Apple Silicon budget — and finds round-to-nearest (RTN) 4-bit is the ceiling, a careful negative…

Does fine-tuning a chatbot make it more brain-like? We checked

Asaria, Salomone, Gandhi·June 23, 2026LLMINTERPRETABILITY

Reports that instruction-tuned models are more "brain-aligned" than their base versions are mostly a chat-template artifact, not a property of alignment training. Under identical raw text, post-training weight changes leave fMRI encoding alignment essentially unchanged (Qwen base…

When an AI searches for the equation behind your data, one dial steers it

Asaria, Salomone, Gandhi·June 23, 2026RLLLM

A mechanistic study of how an RL objective shapes symbolic-regression search: the parsimony coefficient λ cleanly and monotonically sets the operating point on the accuracy–parsimony frontier, while the DAPO clip-higher asymmetry does not — it is the entropy regularizer, not the…

Training an AI to be correct collapses its variety

Asaria, Salomone, Gandhi·June 22, 2026LLMRL

Verifiable-reward RL (RLVR) for procedural Sokoban-level generation triggers a sharp reward↔diversity phase transition: the model mode-collapses to one or two level templates (distinct-valid fraction ≈1.0 → <0.05) across three trust-region objectives (PPO, DAPO, DPPO), seeds, and…

How modular is a frontier Mixture-of-Experts?

Salomone, Gandhi, Asaria·June 20, 2026LLMINTERPRETABILITY

A pre-registered causal test of whether the experts in a frontier MoE (Command A+, 218B total / 25B active, 128 experts) form functional modules tied to capabilities or languages. Of six pre-registered expert families ablated at inference time against a size-matched random-expert…

Teaching a language model to say "I'm not sure" using its own doubt

Asaria, Salomone, Gandhi·June 19, 2026LLM

A model's own token-probability "doubt" signal, used to decide when to abstain, matches label-supervised abstention-tuning without using any correctness labels. Across six open-weights models (1B–8B, two families) on short-form QA, the label-free LoRA recipe shows no statisticall…

Technical blog post

How I built my own Tesla-style self-driving AI

Ali Asaria·June 18, 2026VISION

A DINO-pretrained ViT-S is competitive with a pretrained ResNet-50 at end-to-end steering-angle prediction on a small slice (~5–16k frames) of the comma2k19 driving dataset — turn-slice Pearson 0.964 vs 0.967. Competitiveness is conditional on adaptation (low-LR full fine-tuning,…

Judging to improve: a 3D judge you can train against, and the limit of cheap 3D specialization

Asaria, Salomone, Gandhi·June 18, 20263DVISIONLLM

A trainable, de-biased VLM-as-judge for single-image 3D generation — one VLM family labels training pairs, a different family scores, and verdicts only count when they survive an order swap. Used to test cheap label-free adaptation of a strong base: six methods reach only parity…

Train, retrieve, or both? What it takes to make a language model cite the law correctly

Asaria, Salomone, Gandhi·June 18, 2026LLMRAG

A four-arm head-to-head (base, LoRA SFT, RAG, SFT+RAG) for correct statutory citation on Ontario tenancy law. The base model hallucinates 81% of its citations; retrieval is the decisive lever, driving hallucinations to zero by construction and lifting citation exact-match to 0.44…

Judging single-image 3D generation without humans (and why cheap proxies fall short)

Asaria, Salomone, Gandhi·June 16, 20263DVISION

A standardized evaluation protocol for single-image-to-3D mesh generators, using 24-view rendering and position-bias correction — and showing that common proxies like CLIP similarity and geometry-validity metrics don't substitute for a VLM judge.

A Universal Post-Training Improvement for Open Audio Models

Asaria, Salomone, Gandhi·June 16, 2026AUDIOLLM

ASR-based self-verification drives catastrophic failures (silence, early termination, repetition) to near zero in autoregressive neural-codec TTS, then distills the behavior for inference-time efficiency — generalizing across four TTS systems and three codecs.

Neither parallel nor sequential: how DiffusionGemma actually commits tokens

Asaria, Salomone, Gandhi·June 12, 2026LLM

A close look at token-commitment patterns in DiffusionGemma 26B. Contrary to parallel-decoding marketing, the behavior is neither parallel nor block-autoregressive — weak left-to-right bias and substantial within-batch ordering ambiguity.

Making INT8 actually fast: a fused kernel for Ideogram 4 on a 3090

Asaria, Salomone, Gandhi·June 12, 2026SYSTEMSVISION

A fused Triton kernel that properly drives the INT8 tensor cores on consumer Ampere GPUs — ~1.1× end-to-end speedup, making 1024px generation feasible on a single RTX 3090.

Quantizing Ideogram 4.0 onto a 3090: an INT8 build that matches FP8 and a 4-bit GGUF that beats NF4

Gandhi, Asaria, Salomone·June 10, 2026VISIONSYSTEMS

Post-training quantization of Ideogram 4.0 where INT8 W8A8 comes out statistically indistinguishable from FP8 on key quality metrics, with INT8 and GGUF Q4_K both cutting compute for consumer-GPU deployment.