Skip to main content

What's the best way to run a 30B coding model on a 24 GB Mac?

ยท 7 min read

We explored ways to beat the existing open coding model's stock quantizations. In the end, the bottleneck turned out to be something else.

Quick summary: Our goal was to find the best coding model that fits in 24 GB on an Apple Silicon Mac. We started from North-Mini-Code, a 30-billion-parameter open model (Apache-2.0), and tried roughly a dozen ways to quantize it past its stock 4-bit version: cheaper formats, more expensive ones, custom per-part bit budgets, and a calibrated method we had to build new tooling just to run. None of them beat plain round-to-nearest 4-bit on coding tests. The 4-bit model already fits in 20.81 GB at a realistic working context and decodes at about 40 tokens per second, so memory and speed were never the real limit. The real limit only showed up when we ran the model as an agent: it often falls into repetition loops and never finishes the task. That is a decoding problem, not a quantization problem.

Running a local coding agentโ€‹

With the ongoing surge in token prices, uncertainty over model availability, and a growing need for data privacy, many developers are looking to run a local coding agent. A common version of this is a Mac mini with a 24 GB memory cap, used as a private, always-on box. That cap is the whole challenge. The model's weights, a running scratchpad of everything it has read and written so far (the KV cache), and some overhead all have to fit at once. It also has to generate fast enough to be usable.

North-Mini-Code is a strong candidate for that slot. It is a mixture-of-experts model: it has 30 billion parameters in total, but only about 3 billion run on any given token. It already ships in about 21 quantized builds, and the obvious one to run in 24 GB is the vendor's 4-bit version. Our goal was simple to state: beat that 4-bit build on coding quality, under the same 24 GB cap, measured on our own test harness.

Can anything beat 4-bit?โ€‹

The research literature is unusually clear on one point. For models like this, 4-bit is the operating point, and dropping below it makes coding ability fall apart. So we did not chase a 3-bit or 2-bit win that the field keeps failing to find. Holding the 24 GB budget fixed, we asked whether any method beats 4-bit.

The baseline to beat is round-to-nearest, or RTN: the simplest possible quantization, where you just round each weight to the nearest low-precision value. No data, no cleverness. Fancier methods are usually expected to beat it.

What we threw at itโ€‹

We ran a structured search across the obvious levers:

  • Uniform settings: different bit-widths and number formats.
  • Mixed precision: spend more bits on the parts that matter (the router and attention) and fewer on the rarely-used experts. We included a build that spends more memory than 4-bit, not just less, to give accuracy every chance to come back.
  • A custom per-expert budget tuned to this model's specific architecture.
  • Calibrated GPTQ: a stronger method that looks at example data to decide how to round each weight more carefully, instead of rounding blindly.

The standard calibrated-GPTQ implementation could not run on our 48 GB development Mac at all. It needs the full model resident (about 60 GB) plus a large statistical table per expert, which for 128 experts balloons into hundreds of gigabytes. So we built a streaming version that processes one layer at a time and frees each layer's scratch data before moving on. It quantizes the whole 128-expert model at a 3.75 GB peak, roughly 13 times under the 48 GB box. This independent streaming port is what let us test the strongest method at all, where the stock implementation simply could not run.

Nothing beat 4-bitโ€‹

Every build we tried, by coding score and memory
0.850.860.870.880.890.900.910.921214161820222424 GB capPeak memory (GB)Coding score
โ— full HumanEval-164 (comparable)โ— HE-50 screening subset
RTN 4-bit. The baseline. The simplest quantization there is: round each weight to the nearest value, no data, no tuning. This is the one to beat.
The three builds scored on the full test (RTN 4-bit, the higher-bit mixed build, and calibrated streaming-GPTQ) land on top of each other. Nothing in the 24 GB budget beats plain round-to-nearest 4-bit.

Across every in-budget route, nothing beat plain RTN 4-bit. On the full test the closest builds tied it; on a quicker screening subset the lower-bit builds came in below it. The two closest contenders both tie:

  • The higher-bit mixed build scored 0.9024 against RTN's 0.8902 on the full HumanEval test (a standard set of coding problems). That is a 1.2-point gap, below our 1.5-point significance bar, and a paired statistical test says they are indistinguishable. It also costs about 1.2 GB more for no measurable gain.
  • The calibrated streaming-GPTQ build scored 0.8841, statistically identical to RTN.

A cheap attempt to distill the model into a smaller dense one (training a smaller student to imitate the big model) failed its early checkpoint outright, scoring 0 out of 32. So the best model you can run in 24 GB is the stock 4-bit build. We never measured the original full-precision model, so the finding is that nothing we tried beats 4-bit, not that 4-bit gives up no quality at all.

The bottleneck is the model not stoppingโ€‹

The 4-bit model clears both hard constraints with room to spare: 20.81 GB at a 32,000-token context (a long working window), about 40 tokens per second. Memory and speed were never binding. The trouble showed up when we ran it as a coding agent on real GitHub issues. It resolved just 6 of the 30 issues we tried, about 20%. In 21 of the 30 cases it fell into a repetition loop and ran until it hit the token limit, and 20 of those 21 handed back no patch at all. When it did finish, the patch was usually right.

The model loops instead of stopping when it runs as an agent
tokens generated: 0 / 9,000 cap
21 of 30 runs ended like this, looping to the token cap. 20 of them produced no patch at all. When the model did stop, the patch was usually correct.
The repeated line is the model re-emitting the same code-edit marker instead of finishing. The 4-bit model has the memory and speed to spare. As an agent it often will not stop, which points the fix at the decoding and stopping rules rather than the quantization.

We ran a deliberately bare-bones harness (a single shot, no multi-step loop), so this 20% is a floor on the model's ability, not a fair score. The diagnosis is what matters: for agentic use, the fix to chase is better decoding and stopping rules. The bit budget is already fine.

Takeawaysโ€‹

  • For this model class, plain round-to-nearest 4-bit is already about as good as it gets for the memory it uses. Before trusting a calibrated or mixed-bit method, check that it actually beats RTN. Often it does not.
  • Spending more bits, and more memory, did not help either. Using the whole budget is not automatically a win.
  • For agentic coding, generation stability can matter more than the quantization choice. Measure the agent loop, not just whether it solves a problem on the first try.
  • Streaming, one-layer-at-a-time quantization makes calibrated quantization feasible for large mixture-of-experts models on hardware you already own.

What's nextโ€‹

The 4-bit build is the model to ship in 24 GB today. The open work is not more compression. It is the agent loop: better stopping rules and decoding so the model finishes the task instead of looping. We are also curious whether router calibration, the one lever we did not pull, can move a result that nothing else did.

Built on Cohere Labs' North-Mini-Code-1.0 (Apache-2.0). The streaming-GPTQ code and evaluation harness are open. Read the full paper.