Skip to main content
← All papers

Is 4-bit the Ceiling? Fitting North-Mini-Code into 24 GB on Apple Silicon, and a Streaming GPTQ for Large MoEs

Asaria, Salomone, Gandhi·June 23, 2026SYSTEMSLLM

Asks whether a 30B-total / 3B-active MoE coding model (North-Mini-Code-1.0, 128 experts, top-8) can be quantized to beat its vendor 4-bit while still fitting a hard ≤24 GB Apple Silicon budget — and finds 4-bit is the ceiling. Across a structured search (uniform bit-width, group size, number format, built-in mixed-bit, custom per-expert allocation, calibrated streaming GPTQ), no in-budget route beats round-to-nearest 4-bit on coding eval; the closest, mixed_4_8 at 5.2 avg bits / 20.5 GB, only ties (148 vs 146 of 164, McNemar exact p=0.79). The practical agentic limiter turns out to be generation-length instability (repetition loops), not bit-width — 21/30 SWE-Bench Verified instances looped to the cap. The portable contribution is a memory-efficient streaming GPTQ that quantizes a 60 GB / 128-expert MoE on a 48 GB Mac at 3.75 GB peak.

Your browser can’t display the PDF inline. Download it instead.