This article was originally published on runaihome.com
TL;DR: Qwen 3.6 35B-A3B is a Mixture-of-Experts model that costs only 3B parameters of compute per token — but all 35B must live in VRAM simultaneously, setting a hard 24GB floor. On a 24GB RTX 4090 it reaches 120 tok/s at Q4_K_M; a used RTX 3090 gets 107 tok/s with Ollama, or 135.7 tok/s with a properly tuned llama.cpp config. Anything with 16GB VRAM needs CPU offloading that cuts speed by half.
| RTX 4090 24GB | RTX 3090 24GB | Mac Mini M4 Pro 24GB | |
|---|---|---|---|
| Best for | Fastest local option | Budget 24GB route | Quiet, no discrete GPU |
| Q4_K_M (21 GB) | ✅ 120 tok/s | ✅ 107 tok/s | ✅ 15–22 tok/s |
| Q5_K_M (25 GB) | ❌ Overflow | ❌ Overflow | ❌ Overflow |
| Street price (Jun 2026) | $1,500–$1,700 | $480–$550 used | $1,399 |
Honest take: If you own a 24GB GPU, Qwen 3.6 35B-A3B is worth trying for coding. If you have 16GB, the Qwen 3.6 27B dense sibling fits natively, scores higher on SWE-bench (77.2% vs 73.4%), and runs faster on the same hardware — pick that one instead.
Alibaba released Qwen3.6-35B-A3B on April 16, 2026 under Apache 2.0. The headline number is 73.4% on SWE-bench Verified (Alibaba's internal agent scaffold), which puts it in the same neighborhood as frontier closed models. But the 24GB VRAM requirement is non-negotiable for usable performance, and understanding why that's true requires understanding how MoE memory actually works.
The MoE memory trap
Most people hear "only 3B active parameters" and assume the model needs 3B worth of VRAM. It doesn't.
Mixture-of-Experts routing is cheap at compute time: on each forward pass, the router selects a subset of expert layers — the equivalent of ~3B parameters — to process each token. Everything else sits idle. But idle does not mean absent. All 35B parameters must be loaded into memory before the router can decide which ones to activate.
The practical result: Qwen 3.6 35B-A3B has the inference speed of a 3B model but the VRAM footprint of a 35B model. That's the MoE tradeoff. You get cheap compute in exchange for a fixed memory tax you cannot escape.
This is also why comparing to dense models requires care. The Qwen 3.6 27B dense activates all 27B parameters on every token, which sounds heavier — but at Q4_K_M it needs only ~16 GB, leaving 8 GB of headroom on a 24GB card. And it scores higher on SWE-bench. The 35B-A3B model wins on throughput efficiency and very long-context workloads; it doesn't automatically win on coding quality.
VRAM requirements by quantization
Weight-only VRAM, before KV cache:
| Quantization | VRAM (weights only) | Headroom on 24GB card |
|---|---|---|
| Q3_K_M | ~15.8 GB | 8.2 GB (fits, quality loss) |
| Q4_0 | ~19.6 GB | 4.4 GB |
| Q4_K_M | ~21 GB | 3 GB |
| UD-Q4_K_XL (Unsloth) | ~22.1 GB | 1.9 GB |
| Q5_K_M | ~25.2 GB | ❌ Overflow |
| Q6_K | ~28.7 GB | ❌ Overflow |
| Q8_0 | ~37 GB | Needs 40GB+ |
| FP16 | ~71.8 GB | Server GPUs only |
The Unsloth UD-Q4_K_XL is the community-preferred quant for 24GB GPUs: it uses Unsloth's Dynamic quantization to preserve perplexity closer to Q5 while staying at 22.1 GB. The 1.9 GB of free VRAM after loading is tight for KV cache, which is why KV cache quantization in llama.cpp matters — it compresses the cache to reclaim headroom.
KV cache adds on top of weights:
- 8K context, Q8_0 KV cache: ~1.5 GB extra
- 16K context, Q8_0 KV cache: ~3 GB extra
- 32K context, Q8_0 KV cache: ~6 GB extra
- 262K context (full native): ~48 GB extra — no single consumer GPU
For interactive coding sessions, 16K context covers roughly 12,000 lines of code. Q4_K_M weights plus Q8_0 KV cache keeps you within 24GB. That handles the overwhelming majority of coding tasks. The 262K and 1M context claims are real but require a Mac with 96GB+ unified memory or multi-GPU setups.
GPU compatibility
24GB cards — where this model lives
RTX 4090 ($1,500–$1,700 new)
Fastest consumer option for this model. At Q4_K_M via Ollama: ~78 tok/s with time-to-first-token of ~3.2 seconds at 8K context. Switch to llama.cpp with UD-Q4_K_XL and flash attention enabled: 120+ tok/s. The 4090's 1,008 GB/s bandwidth gives it a slight edge over the 3090 on sustained generation, though the gap is smaller than the bandwidth difference suggests — MoE routing is more memory-latency than bandwidth sensitive.
RTX 3090 ($480–$550 used, June 2026)
Same 24GB as the 4090, lower bandwidth (936 GB/s). Ollama gives ~107 tok/s. With llama.cpp's UD-Q4_K_XL and --cache-type-k q8_0 --cache-type-v q8_0: 135.7 tok/s — 27% faster than Ollama on the same hardware. The RTX 3090 remains the best-value 24GB card for local AI. Full value analysis here.
Mac Mini M4 Pro 24GB ($1,399)
Unified memory means the GPU and CPU share the same 400+ GB/s bandwidth pool — architecturally clean for MoE routing since the router can page expert parameters without crossing a PCIe bus. The downside is absolute throughput: 15–22 tok/s at Q4_0, 8K–16K context. Enough for interactive coding; not for batch work. At 48GB ($1,599), you gain Q5_K_M support and can push context to 32K comfortably.
16GB cards — the wall
Every 16GB consumer GPU (RTX 5060 Ti, RTX 4060 Ti, RTX 5070, RTX 5080, RX 9070 XT) runs into the same issue: Q4_K_M needs 21 GB, and 16 < 21. Three paths forward:
Option 1: Q3_K_M (~15.8 GB) — fits natively on 16GB VRAM, but coding quality drops noticeably. On instruction-following and code completion tasks, Q3 typically loses 5–8 points versus Q4 on this architecture. Whether that's acceptable depends on your workload.
Option 2: CPU layer offloading — with 64 GB+ system DDR5 and -ngl 20 (partial GPU offload in llama.cpp), expect 30–50 tok/s. Usable for background inference tasks, poor for interactive coding where latency matters.
Option 3: Use Qwen 3.6 27B dense — fits in ~16 GB at Q4_K_M, runs at ~140 tok/s on an RTX 4090 (faster than the 35B-A3B), and scores 77.2% SWE-bench. For pure coding on 16GB hardware, this is the right model. See the full 27B guide.
More VRAM (comfortable range)
RTX 5090 32GB — fits Q5_K_M with 7 GB of KV cache headroom. Community reports put it at 160–180 tok/s with llama.cpp. The step from Q4 to Q5 is essentially free since you're using VRAM that would otherwise be idle.
Mac Studio M4 Max 36GB+ — comfortable at Q5_K_M, reaching ~30–40 tok/s. The bandwidth jump (546 GB/s vs 400 GB/s on Mac Mini M4 Pro) translates to roughly 2× generation speed on this model. Full comparison in the Mac Studio vs Mac Mini guide.
Benchmark: Ollama vs llama.cpp on RTX 3090
The RTX 3090 data makes the backend gap concrete:
| Backend | Config | Speed (RTX 3090) |
|---|---|---|
| Ollama 0.20.7 | Default Q4 | 107 tok/s |
| llama.cpp | UD-Q4_K_XL + --cache-type-k q8_0 --cache-type-v q8_0
|
135.7 tok/s (+27%) |
The difference is KV cache quantization. Ollama doesn't expose that knob; llama.cpp does. At 135.7 tok/s the model delivers tokens faster than most people read them — it feels instantaneous for single-user coding sessions.
For context on how this compares to other backends across different workloads, see the vLLM vs Ollama comparison. vLLM pulls ahead at multi-user concurrency but doesn't add value for a single-user home setup.
Setup
Ollama (quickest start)
ollama pull qwen3.6:35b-a3b
ollama run qwen3.6:35b-a3b
Downloads the default Q4_K_M GGUF (~21 GB). Ollama sets a 256K context window in its model card, but the effective limit is your available VRAM mi
Top comments (0)