This article was originally published on aicoderscope.com
TL;DR: Continue.dev + Ollama gives you free, fully local AI coding in both VS Code and JetBrains — the only open-source combo that covers both IDEs. Setup takes 20 minutes. The one trap that breaks most people: Ollama defaults to a 2,048-token context window and silently discards anything beyond it. Fix that before writing a single line of config.
- What you'll be able to do after this guide: run tab autocomplete, chat, and multi-file edits against a local model — no API key, no internet required, no code leaving your machine.
- What you'll need: a GPU with ≥8 GB VRAM (or Apple Silicon with ≥32 GB unified memory), VS Code or any JetBrains IDE, and about 20 minutes.
- Where this setup hits its ceiling: agent tasks that span more than 5 files — at that complexity, cloud models (Claude Sonnet 4.6, GPT-4o) still outperform any local 14B model.
| Continue.dev + Ollama | Cursor Pro | Cline + Ollama | |
|---|---|---|---|
| Best for | VS Code and JetBrains, local-only | Best VS Code agent | VS Code agentic local |
| Price / Cost | $0, no API bill | $20/mo, usage-capped | $0, no API bill |
| The catch | Agent lags Cursor on complex tasks | No local model option at all | VS Code only, no JetBrains |
Honest take: If you're on IntelliJ, PyCharm, or GoLand and need zero cloud, Continue.dev + Ollama is the only serious option. On VS Code, it ties with Cline for local agent work — choose based on whether you want a guided autonomous agent (Cline) or per-role model control with autocomplete (Continue.dev).
Why Local-Only Matters
Most "privacy-first AI coding" guides send your prompts through a relay. Continue.dev + Ollama is different: the VS Code and JetBrains extension is Apache 2.0 open-source, inference runs on your machine, and the BYOK model means there's no Continue-operated server in the request path. If you're working on code under NDA, on a pre-launch product, or under a company policy that prohibits sending source to cloud vendors, this is the setup that actually satisfies those requirements.
The practical check: after setup, pull up your system's network monitor and start a chat. The only connections you'll see are local (localhost:11434). Nothing to Anthropic. Nothing to OpenAI. Nothing to Continue servers. That's verifiable in a way that "we don't train on your data" policy language is not.
Hardware Floor
The model you can run is bounded by VRAM. Approximate fits for the recommended coding models:
| VRAM / Memory | Recommended model | Realistic use case |
|---|---|---|
| 8 GB VRAM (RTX 4060) | qwen2.5-coder:7b (Q4) | Tab autocomplete only; chat is marginal |
| 12 GB VRAM (RTX 3060 12GB) | qwen2.5-coder:14b (Q4) | Real daily-driver for autocomplete + chat |
| 16 GB VRAM (RTX 4060 Ti 16GB) | qwen2.5-coder:14b (Q5) | Solid local setup |
| 24 GB VRAM (RTX 3090 / RTX 4090) | qwen2.5-coder:32b (Q4) | Best local tier; approaches cloud on single-file tasks |
| 32 GB Apple unified memory (Mac Studio M3 Ultra) | qwen2.5-coder:14b comfortably | macOS sweet spot |
| 64 GB+ Apple unified memory | qwen2.5-coder:32b | Best macOS local setup |
The 7B model is tempting because it's fast, but it fails on anything more complex than single-function completions. For chat and edit tasks where Continue.dev shines, 14B is the practical minimum. For a deeper breakdown of which model fits which hardware, our sister site's Best Local AI Models by VRAM tier guide covers the full landscape.
Step 1: Install Ollama
Ollama v0.30.2 released June 3, 2026 is the current version. Install:
Linux:
curl -fsSL https://ollama.com/install.sh | sh
macOS / Windows: Download the installer from ollama.com/download and run it.
Verify the install and check the service is running:
ollama --version
# Expected output: ollama version 0.30.2
curl http://localhost:11434/api/tags
# Expected: {"models":[...]} — empty array if no models pulled yet
Ollama runs as a background service on port 11434. On Linux it installs as a systemd service. On macOS it runs as a menu bar app.
Step 2: Pull a Coding Model
Pick based on your VRAM tier from the table above. For the 14B tier:
ollama pull qwen2.5-coder:14b
This downloads approximately 9 GB. Grab a coffee. Verify it arrived:
ollama list
# NAME ID SIZE MODIFIED
# qwen2.5-coder:14b abc123def456 9.0 GB 2 minutes ago
If you want a dedicated autocomplete model (faster, lighter), also pull the 1.5B:
ollama pull qwen2.5-coder:1.5b
# ~1.1 GB — runs on any GPU with 2+ GB VRAM, response time under 300ms
Running a quick test before involving Continue.dev is worth the 30 seconds:
ollama run qwen2.5-coder:14b "Write a Python function to flatten a nested list."
If you get a sensible code response, the model and Ollama are working. Now the trap.
Step 3: Fix the Context Window — Do This First
This is the step that causes most Continue.dev + Ollama setups to produce bad output silently. Ollama's default context window is 2,048 tokens. For a coding assistant that loads your files into context, this is catastrophic: Continue.dev might be sending 8,000 tokens of repo context, and Ollama silently discards everything past token 2,048. The model has no idea it's missing 75% of the information. The responses look plausible — they're just wrong.
Set the context window before starting your session. The simplest approach is the environment variable:
# Linux/macOS — set before starting Ollama, or export in ~/.bashrc / ~/.zshrc
export OLLAMA_NUM_CTX=16384
# Windows (PowerShell — add to your profile for persistence)
$env:OLLAMA_NUM_CTX = "16384"
For a permanent per-model fix that doesn't require an env var, create a Modelfile:
# qwen-coder-ctx.Modelfile
FROM qwen2.5-coder:14b
PARAMETER num_ctx 16384
Then build it as a named local model:
ollama create qwen-coder-ctx -f qwen-coder-ctx.Modelfile
Now reference qwen-coder-ctx in your Continue config instead of the base model. You can verify the context is set:
ollama show qwen-coder-ctx --parameters
# context_length 16384 ← this is what you want to see
16,384 tokens is a safe floor for most coding tasks. For larger codebases or long agent sessions, push to 32,768 if your VRAM allows it (roughly 1–2 GB additional usage).
Step 4: Install Continue.dev in VS Code
Open VS Code, go to the Extensions panel, search for Continue, and install the extension by Continue Dev, Inc. (2.5 million installs as of May 2026, 33,000 GitHub stars). It will appear as a sidebar panel.
On first launch, Continue prompts you to configure a model. Skip the guided setup — you'll write the config manually in the next step.
Step 5: Install Continue.dev in JetBrains
This is the step no other Continue.dev guide covers specifically, and it's where the setup differs. In any JetBrains IDE (IntelliJ, PyCharm, GoLand, WebStorm, Rider):
- Open Settings → Plugins → Marketplace
- Search for Continue
- Install and restart the IDE
After restart, a Continue panel appears in the right sidebar (look for the Continue icon — a small AI-assist indicator). You can also open it via View → Tool Windows → Continue.
The critical point for JetBrains users: the config.yaml file is shared with VS Code. Both IDEs read from ~/.continue/config.yaml (macOS/Linux) or %USERPROFILE%\.continue\config.yaml (Windows). Configure it once, and the sam
Top comments (0)