DEV Community

Cover image for Finding the Sweet Spot for Local LLMs: Qwen Coder & Llama.cpp
Dmitry Amelchenko
Dmitry Amelchenko

Posted on • Edited on

Finding the Sweet Spot for Local LLMs: Qwen Coder & Llama.cpp

The Shift to Local Models

Running local LLMs for software development is getting increasingly popular, especially as commercial providers continue to charge by the token. It finally makes economic sense to run models locally to avoid cost overruns.

I have personally spent a lot of time trying to figure out the best configuration. After experimenting with LM Studio, Ollama, and RooCode, I finally found a setup that consistently works for my workflow: Llama.cpp running Qwen Coder via GitHub Copilot with OpenSpec SSD.

Here is a breakdown of my experience and the exact configuration I use.

The Hardware Reality

To get decent results locally, hardware is the primary constraint. I was fortunate enough to recently purchase the latest MacBook Pro M5 with 128GB of RAM.

Initially, I had some buyer's remorse spending that much on a machine, but it has proven essential. I tend to consume a lot of memory — I regularly run VS Code with multiple workspaces, React Native servers and simulators, a mail client, and around 100 Google Chrome tabs simultaneously.

Even with all of this running alongside the local LLM, my system rarely swaps more than 2GB of memory to the disk. Performance stays smooth, and I avoid the severe degradation that happens when swap usage climbs higher.

The Software Stack

While wrappers like Ollama and LM Studio are convenient, I found the best results come from running Llama.cpp directly.

Installation on macOS is straightforward:

brew install llama.cpp
Enter fullscreen mode Exit fullscreen mode

For the models, Hugging Face is the best source.
To use Llama.cpp models from Hugging Face, you need files in the GGUF format. These models are optimized for local inference on both CPUs and GPUs. https://huggingface.co/docs/hub/en/gguf-llamacpp

My model of choice is Qwen3.6-35B-A3B-MTP-GGUF:UD-Q8_K_XL.
Let's brake it down what this name stands for:

  • 35B: The model has 35 billion total parameters (the size of its "brain").
  • A3B: It is a "Sparse Mixture-of-Experts" (MoE) architecture. Instead of using all 35B parameters to answer a question, it dynamically activates only 3 billion "active" parameters per token. This delivers massive speed and efficiency without sacrificing intelligence.
  • MTP: Multi-Token Prediction. The model is trained to predict multiple tokens (words) at once rather than one-by-one, significantly accelerating generation speeds during inference.
  • GGUF: Generalized GPU-CPU Fusion. A popular file format used for running AI models locally. It allows you to split the model between your graphics card (VRAM) and your computer's regular system memory (RAM).
  • UD: A specialized quantization method developed by the Unsloth team designed to preserve maximum intelligence at lower file sizes.
  • Q8_K_XL: The specific level of quantization (compression).Q8 means it is an 8-bit quantization.It aggressively reduces file size compared to the original, while still maintaining extremely high quality (nearly matching the original uncompressed model).XL indicates a specific weighting adjustment meant for Unsloth's extra-large context-size handling.

The Quantization Goldilocks Zone

Quantization makes a massive difference in performance and stability. I went through quite a bit of trial and error to find the right balance:

  • 4-bit: I tried this first, but it lacked precision. For complex coding tasks, the model would frequently get stuck in infinite loops (pretty much always for me).
  • 16-bit: I attempted to run the 32B parameter model at 16-bit, but it was simply too large for my hardware to handle.
  • 8-bit: This was the sweet spot. It fits within my memory constraints while executing complex reasoning flawlessly. To download and run the server with the given model on your local environment, run:
llama-server -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q8_K_XL
Enter fullscreen mode Exit fullscreen mode

When you installed llama.cpp via brew, you can run the above command, which will take a while the first time -- depending on your internet speed, up to an hour or so. Running it again will access the cached version, so it will take only a few seconds to load into the memory and start. You can check if it's running by going to http://localhost:8080/

My Copilot Configuration

If you run a model locally, GitHub Copilot does not charge you for tokens (not yet, anyways), meaning you can stay on the standard plan while running complex code analysis.

In VSCode, copilot chat window, click on the model picker and select a gear next to the "Other Models":

Then select "add models" and "custom endpoint",
type "llama.cpp" for group name, hit "enter" for the API key, and "Enter" again for the API type (the value does not really matter).

After you save the initial config, you should be able to see the llama.cpp in the list of models when you click the gear next to the "Other Models" selection again. Then, click a gear next to "llama.cpp" and open the config as JSON. To save you time -- here is the final JSON I have, you may want to copy and paste it in your config:

    {
        "name": "llama.cpp",
        "vendor": "customendpoint",
        "apiKey": "${input:chat.lm.secret.6d112807}",
        "models": [
            {
                "id": "Local Llama, qwen-Q8_0",
                "name": "qwen-Q8_0",
                "url": "http://localhost:8080",
                "toolCalling": true,
                "vision": false,
                "reasoning": true,
                "thinking": true,
                "maxInputTokens": 131072,
                "maxOutputTokens": 131072,
                "contextWindowSize": 262144,
                "parameters": {
                    "top_k": 20,
                    "top_p": 0.95,
                    "min_p": 0.0,
                    "repetition_penalty": 1.0,
                    "temperature": 0.6,
                    "max_new_tokens": 1500,
                    "num_ctx": 16384,
                    "num_gpu": -1,
                    "num_thread": 12
                }
            }
        ]
    },
Enter fullscreen mode Exit fullscreen mode

Let's go over some of these parameters:

  • Endpoint: localhost:8080 (or your specific local port)
  • Tool Calling: true
  • Vision: false (I tried enabling this, but Qwen kept returning errors that vision is unsupported)
  • Reasoning & Thinking: true
  • Context Window: 256k total (128k max input / 128k max output)
  • Temperature: 0.6
  • GPU Offload (num_gpu): -1 (Uses all available GPUs)
  • Threads: 12 (Maps to the performance cores on my Mac)

A Note on Reasoning and Temperature

Some documentation suggests turning "Reasoning" and "Thinking" off for coding tasks, but my experiments proved otherwise. I use Spec-Driven Development with OpenSpec, which requires heavy analysis and planning before any code is written. Leaving reasoning set to true yielded significantly better results for this workflow.

Additionally, keep your temperature at 0.6 rather than strict 0.0. A purely deterministic 0.0 temperature can cause the model to get permanently stuck if it hits a logic loop. That bump gives it just enough variance to diverge and find a solution.

Few words on agentic coding

No matter the model you use -- Vibe coding, as we know it, is always going to suffer from the Architecture Entropy. Read more on this topic here in this post The End of Vibe Coding
This is why Spec-Driven Development (SDD) is a must.
Also, on the topic of why it's essential to keep your Architecture "as simple as possible, but not simpler", read the following post Why GenAI Billing Makes Minimalist Architecture Mandatory

The Payoff

The results I am getting from this local Qwen setup are remarkably close to top-tier remote models like Claude Opus 4.6 .

Between the high-quality output and the fact that I am saving a couple of hundred dollars a month on API costs, the heavy upfront investment in the MacBook Pro will pay for itself within a couple of years. If you have the hardware, I highly recommend giving this stack a try.

Top comments (3)

Collapse
 
dmitryame profile image
Dmitry Amelchenko

After doing much experimentation and finetunning, as of today the best performance and speed I was able to get from the following model:

llama-server -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q8_K_XL

Enter fullscreen mode Exit fullscreen mode

I also updated the article to reflect some tweaks.

Collapse
 
muthuraj_91 profile image
Muthu Raj • Edited

Hi Dmitry, the setup sounds impressive. I am curious whether you have compared this local setup against cloud-hosted coding assistants for real development workflows. Are there any tasks where the local models consistently outperform cloud-based alternatives, beyond privacy and cost considerations?

Collapse
 
dmitryame profile image
Dmitry Amelchenko • Edited

Hi Muthu. Cloud hosted coding will always be faster and better -- it's just a totally different level of resources available in the cloud. Though, running it locally gives you a lot more flexibility for fine tuning parameters for a specific task you need to execute. Scarcity usually drives more creative approach. With cloud based agentic coding we typically take what's offered to us by the provider -- running things locally we are forced to actually understand which parameters to tweak to drive different outcomes. For instance MTP predicts multiple steps ahead, which is superior for complex architecture analysis, but tends to be very slow (if you can afford to wait). Q8 models typically require significantly less memory than BF16, but slightly less accurate. You can also play around with context size etc... So it's totally up to you to pick what works best for you.