DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at aicoderscope.com

Continue.dev + Ollama 2026: local AI coding setup for VS Code and JetBrains with no API key

This article was originally published on aicoderscope.com

TL;DR: Continue.dev + Ollama gives you free, fully local AI coding in both VS Code and JetBrains — the only open-source combo that covers both IDEs. Setup takes 20 minutes. The one trap that breaks most people: Ollama defaults to a 2,048-token context window and silently discards anything beyond it. Fix that before writing a single line of config.

  • What you'll be able to do after this guide: run tab autocomplete, chat, and multi-file edits against a local model — no API key, no internet required, no code leaving your machine.
  • What you'll need: a GPU with ≥8 GB VRAM (or Apple Silicon with ≥32 GB unified memory), VS Code or any JetBrains IDE, and about 20 minutes.
  • Where this setup hits its ceiling: agent tasks that span more than 5 files — at that complexity, cloud models (Claude Sonnet 4.6, GPT-4o) still outperform any local 14B model.
Continue.dev + Ollama Cursor Pro Cline + Ollama
Best for VS Code and JetBrains, local-only Best VS Code agent VS Code agentic local
Price / Cost $0, no API bill $20/mo, usage-capped $0, no API bill
The catch Agent lags Cursor on complex tasks No local model option at all VS Code only, no JetBrains

Honest take: If you're on IntelliJ, PyCharm, or GoLand and need zero cloud, Continue.dev + Ollama is the only serious option. On VS Code, it ties with Cline for local agent work — choose based on whether you want a guided autonomous agent (Cline) or per-role model control with autocomplete (Continue.dev).

Why Local-Only Matters

Most "privacy-first AI coding" guides send your prompts through a relay. Continue.dev + Ollama is different: the VS Code and JetBrains extension is Apache 2.0 open-source, inference runs on your machine, and the BYOK model means there's no Continue-operated server in the request path. If you're working on code under NDA, on a pre-launch product, or under a company policy that prohibits sending source to cloud vendors, this is the setup that actually satisfies those requirements.

The practical check: after setup, pull up your system's network monitor and start a chat. The only connections you'll see are local (localhost:11434). Nothing to Anthropic. Nothing to OpenAI. Nothing to Continue servers. That's verifiable in a way that "we don't train on your data" policy language is not.

Hardware Floor

The model you can run is bounded by VRAM. Approximate fits for the recommended coding models:

VRAM / Memory Recommended model Realistic use case
8 GB VRAM (RTX 4060) qwen2.5-coder:7b (Q4) Tab autocomplete only; chat is marginal
12 GB VRAM (RTX 3060 12GB) qwen2.5-coder:14b (Q4) Real daily-driver for autocomplete + chat
16 GB VRAM (RTX 4060 Ti 16GB) qwen2.5-coder:14b (Q5) Solid local setup
24 GB VRAM (RTX 3090 / RTX 4090) qwen2.5-coder:32b (Q4) Best local tier; approaches cloud on single-file tasks
32 GB Apple unified memory (Mac Studio M3 Ultra) qwen2.5-coder:14b comfortably macOS sweet spot
64 GB+ Apple unified memory qwen2.5-coder:32b Best macOS local setup

The 7B model is tempting because it's fast, but it fails on anything more complex than single-function completions. For chat and edit tasks where Continue.dev shines, 14B is the practical minimum. For a deeper breakdown of which model fits which hardware, our sister site's Best Local AI Models by VRAM tier guide covers the full landscape.

Step 1: Install Ollama

Ollama v0.30.2 released June 3, 2026 is the current version. Install:

Linux:

curl -fsSL https://ollama.com/install.sh | sh
Enter fullscreen mode Exit fullscreen mode

macOS / Windows: Download the installer from ollama.com/download and run it.

Verify the install and check the service is running:

ollama --version
# Expected output: ollama version 0.30.2

curl http://localhost:11434/api/tags
# Expected: {"models":[...]} — empty array if no models pulled yet
Enter fullscreen mode Exit fullscreen mode

Ollama runs as a background service on port 11434. On Linux it installs as a systemd service. On macOS it runs as a menu bar app.

Step 2: Pull a Coding Model

Pick based on your VRAM tier from the table above. For the 14B tier:

ollama pull qwen2.5-coder:14b
Enter fullscreen mode Exit fullscreen mode

This downloads approximately 9 GB. Grab a coffee. Verify it arrived:

ollama list
# NAME                    ID              SIZE    MODIFIED
# qwen2.5-coder:14b       abc123def456    9.0 GB  2 minutes ago
Enter fullscreen mode Exit fullscreen mode

If you want a dedicated autocomplete model (faster, lighter), also pull the 1.5B:

ollama pull qwen2.5-coder:1.5b
# ~1.1 GB — runs on any GPU with 2+ GB VRAM, response time under 300ms
Enter fullscreen mode Exit fullscreen mode

Running a quick test before involving Continue.dev is worth the 30 seconds:

ollama run qwen2.5-coder:14b "Write a Python function to flatten a nested list."
Enter fullscreen mode Exit fullscreen mode

If you get a sensible code response, the model and Ollama are working. Now the trap.

Step 3: Fix the Context Window — Do This First

This is the step that causes most Continue.dev + Ollama setups to produce bad output silently. Ollama's default context window is 2,048 tokens. For a coding assistant that loads your files into context, this is catastrophic: Continue.dev might be sending 8,000 tokens of repo context, and Ollama silently discards everything past token 2,048. The model has no idea it's missing 75% of the information. The responses look plausible — they're just wrong.

Set the context window before starting your session. The simplest approach is the environment variable:

# Linux/macOS — set before starting Ollama, or export in ~/.bashrc / ~/.zshrc
export OLLAMA_NUM_CTX=16384

# Windows (PowerShell — add to your profile for persistence)
$env:OLLAMA_NUM_CTX = "16384"
Enter fullscreen mode Exit fullscreen mode

For a permanent per-model fix that doesn't require an env var, create a Modelfile:

# qwen-coder-ctx.Modelfile
FROM qwen2.5-coder:14b
PARAMETER num_ctx 16384
Enter fullscreen mode Exit fullscreen mode

Then build it as a named local model:

ollama create qwen-coder-ctx -f qwen-coder-ctx.Modelfile
Enter fullscreen mode Exit fullscreen mode

Now reference qwen-coder-ctx in your Continue config instead of the base model. You can verify the context is set:

ollama show qwen-coder-ctx --parameters
# context_length        16384   ← this is what you want to see
Enter fullscreen mode Exit fullscreen mode

16,384 tokens is a safe floor for most coding tasks. For larger codebases or long agent sessions, push to 32,768 if your VRAM allows it (roughly 1–2 GB additional usage).

Step 4: Install Continue.dev in VS Code

Open VS Code, go to the Extensions panel, search for Continue, and install the extension by Continue Dev, Inc. (2.5 million installs as of May 2026, 33,000 GitHub stars). It will appear as a sidebar panel.

On first launch, Continue prompts you to configure a model. Skip the guided setup — you'll write the config manually in the next step.

Step 5: Install Continue.dev in JetBrains

This is the step no other Continue.dev guide covers specifically, and it's where the setup differs. In any JetBrains IDE (IntelliJ, PyCharm, GoLand, WebStorm, Rider):

  1. Open SettingsPluginsMarketplace
  2. Search for Continue
  3. Install and restart the IDE

After restart, a Continue panel appears in the right sidebar (look for the Continue icon — a small AI-assist indicator). You can also open it via View → Tool Windows → Continue.

The critical point for JetBrains users: the config.yaml file is shared with VS Code. Both IDEs read from ~/.continue/config.yaml (macOS/Linux) or %USERPROFILE%\.continue\config.yaml (Windows). Configure it once, and the sam

Top comments (0)