Jovan Chan

Posted on Jun 12 • Originally published at aicoderscope.com

Aider + LM Studio 2026: setup guide, the output-token ceiling that truncates diffs, and which models actually hold up

#aider #lmstudio #localllm #setupguide

This article was originally published on aicoderscope.com

TL;DR: Aider v0.86.0 connects to LM Studio 0.4.15's local server in three environment variables and a model prefix — but there is one silent failure mode that will truncate your code diffs mid-function with no error message. This guide sets everything up correctly, fixes that problem before it bites you, and tells you which 2026 coding models are worth loading.

What you'll be able to do after this guide:

Serve a quantized coding model from LM Studio 0.4.15 on http://localhost:1234/v1
Run Aider against it using the correct model prefix and environment variables
Override the output-token ceiling in .aider.model.settings.yml so diffs don't get cut in half
Pick a model matched to your actual VRAM

Honest take: If you already use Ollama and it's working, stay on Ollama — the Aider + Ollama guide covers that path and Ollama's simpler on Apple Silicon and Linux. LM Studio earns the swap on two specific setups: Windows machines where the polished CUDA auto-detection saves you twenty minutes of troubleshooting, and developers with a home desktop GPU they want to serve to a lightweight laptop via LM Link.

Why LM Studio over Ollama for Aider

The Aider + Ollama guide documents the Ollama path. LM Studio earns its own article for specific cases.

Windows-first hardware. Ollama on Windows has improved in 2026 but still needs occasional CUDA path fiddling. LM Studio's Windows installer auto-detects your CUDA version and runtime; most developers are up and serving in under five minutes.

GUI model browser with VRAM estimates. Searching for a model in LM Studio's Discover tab shows GGUF quantization options, the approximate VRAM footprint for each, and whether your GPU can load it. For someone new to quantization levels, seeing "Q4_K_M — 20.1 GB" next to a model removes a lot of guesswork.

Parallel inference since 0.4.0. LM Studio 0.4.0 (February 2026) introduced the llmster daemon for concurrent request processing. Aider can issue rapid sequential tool calls — reading multiple files, applying edits, checking output — and the queuing behavior in older LM Studio versions created visible stalls between steps. With parallel inference on, the gaps shrink.

LM Link for remote GPUs. LM Studio 0.4.15 (May 29, 2026) added end-to-end encrypted remote connections via Tailscale. If you code on a MacBook Pro but have a desktop with a 24 GB GPU at home, you can serve models from the desktop and point Aider at the LM Link address without changing any other configuration.

The tradeoff is real: LM Studio is a several-hundred-MB GUI application. Ollama is a single binary. On a headless server or Apple Silicon Mac, Ollama runs leaner and faster.

Hardware floor

Aider generates complete code edits as diffs, then applies them to your files and auto-commits. That workflow requires the model to hold context across multiple files, understand the existing code structure, and produce syntactically correct diffs. 7B models fail this regularly.

Hardware	Recommended model	Notes
RTX 4060 8 GB	Qwen3-8B Q4_K_M	~5 GB VRAM; single-file edits only, multi-file agentic tasks unreliable
RTX 3060 12 GB	Qwen2.5-Coder-14B Q4_K_M	Minimum practical floor; handles most everyday Aider work
RTX 4060 Ti 16 GB	Qwen3.6-27B Q3_K_M	Good daily driver; slower than 14B but noticeably more coherent on complex refactors
RTX 3090 / RTX 4090 24 GB	Devstral Small 2 Q4_K_M	24B params, 68% SWE-bench Verified; best local option for agentic coding in 2026
Mac M3/M4 unified	Use Ollama + MLX	LM Studio on Apple Silicon exists but Ollama + MLX runs faster for most coding models

Devstral Small 2 (released May 2026 by Mistral AI) is the current ceiling for what runs on consumer hardware — 24B parameters, 68% on SWE-bench Verified, fits on a 24 GB card at Q4_K_M with room for a 32k context window. The RTX 4090 is the practical target GPU for that model. For hardware context, runaihome.com's local AI model by VRAM tier guide covers the tradeoffs in detail.

Step 1: Install LM Studio 0.4.15

Download from lmstudio.ai. The current stable release is 0.4.15 (build 2, May 29, 2026). The installer comes as a .exe (Windows), .dmg (macOS), or AppImage/deb (Linux).

On Windows, run the installer — it handles CUDA detection automatically. On Linux:

chmod +x LM-Studio-0.4.15-x86_64.AppImage
./LM-Studio-0.4.15-x86_64.AppImage --no-sandbox

The --no-sandbox flag is needed on some distributions; try without it first.

LM Studio is free for both personal and commercial use as of 2025 (the policy change was announced on the LM Studio blog — no license form, no paid tier required).

Once installed, go to Settings → Developer Mode and enable it. LM Studio 0.4.0 merged the old Developer and Power User panels; Developer Mode unlocks the server tab and parallel inference settings you'll need next.

Step 2: Download a coding model

Open the Discover tab and search for your model. For a 24 GB card, search devstral-small-2 and select the Q4_K_M GGUF. LM Studio shows the estimated VRAM next to each quantization option.

Alternatively, from the lms CLI that ships with LM Studio 0.4:

# List available models matching a name
lms search devstral

# Download the Q4_K_M quantization interactively
lms get "bartowski/Devstral-Small-2-2506-GGUF"

# Verify it downloaded
lms ls

After download, LM Studio shows the model in your local models list.

Step 3: Start the local server

In LM Studio, go to the Developer tab (visible once Developer Mode is on). Select your loaded model from the dropdown and click Start Server. The default port is 1234.

For parallel inference, open Advanced Server Settings and set Concurrent Request Limit to 4 or more. With a single Aider session, this isn't critical, but it prevents request queuing if you run multiple terminals or additional tooling simultaneously.

Verify the server is running:

curl http://localhost:1234/v1/models

The response lists every loaded model with its exact ID string — you'll need that string in the next step.

Step 4: Connect Aider

Install Aider if you haven't:

pip install aider-chat

The current version is v0.86.0. Set two environment variables before running Aider:

export OPENAI_API_BASE=http://localhost:1234/v1
export OPENAI_API_KEY=lm-studio   # any non-empty string; LM Studio ignores the value on localhost

On Windows (PowerShell):

$env:OPENAI_API_BASE = "http://localhost:1234/v1"
$env:OPENAI_API_KEY  = "lm-studio"

Then run Aider with the openai/ prefix and the model ID from /v1/models:

aider --model openai/devstral-small-2-2506-GGUF/devstral-small-2-2506-Q4_K_M.gguf

You should see Aider's startup banner and a prompt. If you get "model not found," skip to the model-ID trap section below.

The model-ID trap

This is the first place most setups break. LM Studio generates model IDs that include the full GGUF path — organization, repository name, and quantization suffix:

bartowski/Devstral-Small-2-2506-GGUF/devstral-small-2-2506-Q4_K_M.gguf

The exact string varies by how the model was downloaded and which community packaged it. Get the right string from the live server before constructing your Aider command:


bash
curl -s http://localhost:1234/v1/m

DEV Community