Jovan Chan

Posted on Jun 11 • Originally published at runaihome.com

Qwen 3.7-Max for Local AI in 2026: What VRAM You'll Need When the Open Weights Drop

#qwen #localllm #vram #moe

This article was originally published on runaihome.com

TL;DR: Qwen 3.7-Max launched May 19 as a closed-weight API model scoring 80.4% on SWE-Verified — effectively tied with Claude Opus 4.6. Open weights (expected as a 27B dense and a 35B-A3B MoE variant) are anticipated mid-to-late June 2026 based on Alibaba's 3–4 week release cadence. A 24 GB GPU handles both formats at Q4_K_M today using the Qwen 3.6 generation, and the 3.7 open weights will land in the same VRAM tier.

	RTX 3090 24GB (used)	RTX 4090 24GB	Mac Studio M4 Max 96GB
Best for	35B-A3B MoE or 27B dense at Q4	Same + ~2x faster tok/s	Q6_K quality, silent operation
Price	~$600–$800 used (Jun 2026)	~$1,600 new	$3,999
The catch	Tight VRAM at 128K+ context	Overkill for this tier	ARM ecosystem friction

Honest take: Buy a used RTX 3090 now, run Qwen 3.6-27B today, and swap to 3.7 open weights the day they land — identical hardware requirement, one ollama pull away.

What Qwen 3.7-Max Actually Is

Alibaba announced Qwen 3.7-Max on May 19, 2026. It's a proprietary Mixture-of-Experts model — Alibaba hasn't disclosed the parameter count, but third-party analysis from Artificial Analysis estimates it in the same range as the Qwen 3.6-Max-Preview, approximately 1 trillion or more total parameters. Context window: 1 million tokens. API pricing on OpenRouter: $1.25/M input tokens, $3.75/M output tokens.

"Max" in Alibaba's naming means closed-weights flagship. Every previous Qwen generation has followed the same pattern: API-only Max drops first, open-weight smaller variants come 3–4 weeks later. For home-lab purposes, that's the entire story right now — you cannot run Max locally. You're waiting on the open models.

What you can run today: Qwen 3.6-27B (dense) and Qwen 3.6-35B-A3B (MoE), both available on Hugging Face. These are the direct predecessors and share the hardware profile of whatever Qwen 3.7 open weights Alibaba releases.

Benchmark Context: Where 3.7-Max Sits

Qwen 3.7-Max scores 80.4% on SWE-Verified, essentially tied with Claude Opus 4.6 (80.8%) and slightly ahead of DeepSeek V4 Pro Max (80.6%). On the harder SWE-Pro benchmark, it reaches 60.6% — ahead of Kimi K2.6 Thinking (59.5%) and DeepSeek V4 Pro Max (59.0%). On Terminal Bench 2.0-Terminus: 69.7% vs DeepSeek V4 Pro Max at 67.9%.

That's frontier-class coding performance. For home-lab planning, the relevant implication is that open-weight variants will trade some of that capability for runability. Based on the Qwen 3.6 precedent, open-weight models typically land 8–15 percentage points below the Max flagship on agentic benchmarks. Still competitive for offline coding, document work, and personal agents.

Worth noting: Qwen 3.7-Plus (the mid-tier API model) matches 3.7-Max on AIME mathematics benchmarks while running 3x faster at lower cost. The open weights will be more Plus-tier than Max-tier in real-world quality — useful context before building hardware expectations around the 80.4% SWE number.

The Open-Weight Timeline

Alibaba has been consistent enough that you can set a rough calendar:

Release	API to Open Weights	Gap
Qwen 3.5	API Jan 2026 → Open Feb 2026	~3 weeks
Qwen 3.6	API early Apr 2026 → Open Apr 16, 2026	~3.5 weeks
Qwen 3.7	API May 19 → Estimated mid-Jun 2026	~3–4 weeks

The QwenLM GitHub is the leading indicator — the community watches it for new repository pushes before any blog announcement. No open-weight Qwen 3.7 repository existed as of June 7, 2026, but the window is open.

Expected open-weight variants, extrapolated from the Qwen 3.6 release structure:

Qwen3.7-27B — dense model, same architecture class as Qwen3.6-27B
Qwen3.7-35B-A3B — MoE, 35B total parameters / 3B active parameters per token

A 72B dense variant is not expected in the first wave. Alibaba has consistently shipped the 27B dense and 35B MoE as the home-lab-viable tier before releasing larger sizes.

Hardware Reality: VRAM Requirements

Since the 3.7 open weights aren't available yet, Qwen 3.6 is the most accurate proxy. Both generations share the same architectural lineage (hybrid attention, MoE routing structure), and Alibaba has not made architectural changes that would significantly shift the VRAM footprint between 3.6 and 3.7 at equivalent sizes.

These are measured VRAM numbers from community llama.cpp GGUF testing:

Qwen3.6-27B Dense — Proxy for Qwen3.7-27B

Quantization	VRAM Required	Min GPU
Q4_K_M	~16.8 GB	RTX 3090/4090 24GB
Q5_K_M	~19.5 GB	24 GB required
Q6_K	~22.5 GB	24 GB (tight on 3090)
Q8_0	~28.6 GB	Dual 16 GB or Mac unified

Qwen3.6-35B-A3B MoE — Proxy for Qwen3.7-35B-A3B

Quantization	VRAM Required	Min GPU
Q4_K_M	~21 GB	RTX 3090/4090 24GB
Q5_K_M	~24.5 GB	RTX 4090 (tight) or dual-GPU
Q8_0	~43 GB	Dual RTX 3090 or workstation card

For a breakdown of what quality degradation actually looks like at each quantization level, see the Q4 vs Q5 vs Q6 vs Q8 quality comparison.

The 8 GB Wall

At 8 GB VRAM — RTX 4060, RTX 4060 Ti 8GB, or the base RTX 5060 8GB — neither the 27B dense nor the 35B-A3B MoE fits at any practical quantization. Q3_K_M on the 27B requires ~14 GB, still over budget; the smallest model that works at 8 GB would need to be a sub-9B size class. If Alibaba follows the Qwen 3.5 pattern and ships a full family (from 0.6B to 72B), there may be 7B or 9B variants. But those won't carry the reasoning improvements that make 3.7-Max interesting.

On why the 16GB vs 8GB split matters so much for this generation, see the RTX 5060 Ti 8GB vs 16GB comparison.

Real Token Speed Numbers (Qwen 3.6 Proxy)

These are community benchmark results using llama.cpp GGUFs on consumer hardware — the best predictor of what Qwen 3.7 open weights will deliver at the same VRAM tier:

GPU	Model	Quant	tok/s
RTX 3090 24GB	Qwen3.6-35B-A3B	Q4_K_M	55–65
RTX 3090 24GB	Qwen3.6-27B	Q4_K_M	~35 baseline / ~74 with DFlash
RTX 4090 24GB	Qwen3.6-35B-A3B	Q4_K_M	~122
Mac Studio M4 Max 96GB	Qwen3.6-27B	Q4_K_M	~16.6

The MoE architecture advantage matters here. Qwen3.6-35B-A3B activates only 3B parameters per forward pass despite having 35B total parameters — so inference speed looks closer to a 3B dense model. On an RTX 3090 you get ~60 tok/s on the MoE vs ~35 tok/s on the 27B dense, at almost identical VRAM usage.

The RTX 4090 gap is real: ~122 tok/s vs ~60 tok/s on the 3090 for the 35B-A3B MoE. Whether that 2x throughput is worth $800+ extra depends entirely on your use case. For interactive coding agents, 60 tok/s is already fast enough to stay out of your way. For batch jobs processing dozens of long documents, the 4090 starts earning its premium.

Context length effects: At 32K context the VRAM figures hold. Push to 128K and KV cache adds 4–8 GB. Push to the full 1M-token context window (which 3.7-Max supports in the API) and you're offloading KV cache to system RAM — expect single-digit tok/s regardless of GPU. For home-lab use (coding assistant, document Q&A, local agent), 32K context covers the overwhelming majority of workloads.

Running Qwen 3.6 Now: Your Bridge to 3.7

While the open weights are pending, Qwen 3.6-27B is available on Ollama and runs on the same hardware you'll use for 3.7. A basic setup:

ollama pull qwen3.6:27b-q4_K_M
ollama run qwen3.6:27b-q4_K_M

Expected on first load on a 24 GB GPU (16.8 GB VRAM allocated, ~7 GB headroom):

pulling manifest ✓
pulling qwen3.6:27b-q4_K_M... ████████████████ 100%
>>> Send a message

If you hit an out-of-memory error:



error: CUDA out of memory. Tried to allocate 2.50 Gi

DEV Community