Round-to-Nearest Is Hard to Beat: A 30B MoE Coding Model in 24 GB on Apple Silicon

Salomone, Asaria, Gandhi·June 23, 2026SYSTEMSLLM

Asks whether a 30B-total / 3B-active MoE coding model (North-Mini-Code-1.0, 128 experts, top-8, Apache-2.0) can be quantized to beat its vendor 4-bit while still fitting a hard ≤24 GB Apple Silicon budget — and finds round-to-nearest (RTN) 4-bit is the ceiling, a careful negative result. Across a structured search (uniform bit-width, group size, number format, built-in mixed-bit, a North-aware custom per-expert allocation, and calibrated streaming GPTQ), no in-budget route beats RTN 4-bit on coding eval, including configurations that spend more memory than 4-bit. The two closest tie RTN on full HumanEval-164: the in-budget higher-bit mixed_4_8 (5.2 avg bits, 20.5 GB) scores 0.9024 (148 vs 146 of 164, McNemar exact p=0.79), and calibrated streaming-GPTQ scores 0.8841 (p=1.000); a cheap MoE→dense distillation pilot fails its triage gate (0/32). The 4-bit model fits 20.81 GB at a real 32K-token agentic turn and decodes 40.2 tok/s, so both hard constraints are met while the primary success criterion is not. The portable contribution is a memory-efficient streaming GPTQ that quantizes a 60 GB / 128-expert MoE on a 48 GB Mac at 3.75 GB peak, where stock mlx-lm GPTQ is infeasible. What limits agentic use is generation-length instability (21/30 SWE-Bench Verified instances loop to the token cap), not bit-width or memory.

Download PDF ↓