choutos

Posted on Feb 16 • Originally published at wanderclan.eu

How I Cut My LLM Costs by 70% Without Losing Quality

#ai #llm #devops #costoptimization

You're running AI in production. Things are going well — users love it, the team is shipping features, and then the invoice arrives.

$2,400/day on API calls. For what started as "a few GPT-4 calls here and there."

I've been there. Running a multi-agent system where every task — from simple text classification to complex reasoning — was hitting Claude Opus or GPT-4. The quality was great. The bill was not.

Over three months, I got that $2,400/day down to ~$700/day with no measurable quality loss on 94% of tasks. Here's exactly how.

The Cost Problem: Let's Talk Numbers

First, let's ground this in reality. Current pricing (as of early 2026):

Model	Input (per 1M tokens)	Output (per 1M tokens)
GPT-4o	$2.50	$10.00
Claude Opus 4	$15.00	$75.00
Claude Sonnet 4	$3.00	$15.00
Claude Haiku	$0.25	$1.25
GPT-4o mini	$0.15	$0.60
Qwen 2.5 7B (local)	$0.00	$0.00*

*Hardware costs apply, but if you already have a GPU sitting around, marginal cost is electricity.

A single Claude Opus call with a typical system prompt (~2K tokens) plus context (~3K tokens) generating a ~1K token response costs roughly $0.10. That sounds tiny until you're making 25,000 calls a day.

The maths is simple but brutal: most production AI systems are running their most expensive model on tasks that don't need it.

Strategy 1: Model Routing — The Biggest Win

This single change cut my costs by ~45%.

The idea: classify incoming tasks by complexity, then route to the appropriate model. Not every question needs a PhD — some just need a lookup.

# Simplified routing logic
def route_request(task: dict) -> str:
    complexity = estimate_complexity(task)

    if complexity == "simple":
        # Classification, extraction, formatting, simple Q&A
        return "gpt-4o-mini"  # or local model
    elif complexity == "medium":
        # Summarisation, code review, standard generation
        return "claude-sonnet-4-20250514"
    else:
        # Complex reasoning, multi-step analysis, creative work
        return "claude-opus-4-20250514"

The complexity estimator doesn't need to be fancy. In my case, a simple heuristic based on task type, input length, and whether the task requires multi-step reasoning got me 90% of the way there. You can even use a cheap model to classify — a GPT-4o mini call to decide routing costs fractions of a cent.

What I found in practice:

~60% of requests were "simple" — classification, entity extraction, formatting, template filling
~25% were "medium" — summarisation, standard content generation, code explanation
~15% actually needed top-tier reasoning

That means 60% of my spend was going to a model 100x more expensive than necessary.

Strategy 2: Fallback Chains

Model routing handles the happy path. Fallback chains handle everything else — rate limits, outages, and cost control.

Primary: Claude Opus 4
  ↓ (rate limited or timeout)
Secondary: Claude Sonnet 4
  ↓ (API down or budget exceeded)  
Tertiary: Local Qwen 2.5 7B via Ollama

I use LiteLLM as the routing layer. It gives you a unified OpenAI-compatible API across providers with built-in fallbacks, retries, and spend tracking.

# litellm config
model_list:
  - model_name: reasoning-heavy
    litellm_params:
      model: anthropic/claude-opus-4-20250514
      max_budget: 500  # daily cap in USD

  - model_name: reasoning-heavy
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514  # fallback

  - model_name: simple-tasks
    litellm_params:
      model: ollama/qwen2.5:7b
      api_base: http://localhost:11434

The daily budget cap is crucial. Once your primary model hits spend limits, requests automatically fall through to cheaper alternatives. You get cost predictability without building it yourself.

Strategy 3: Smart Caching

You'd be surprised how many "unique" requests are actually near-duplicates.

Exact caching

The low-hanging fruit. Hash the prompt, cache the response. If someone asks the same question twice, don't pay twice. I use Redis with a 24-hour TTL. This alone saved ~8% on costs.

Semantic caching

More interesting. Use embeddings to find semantically similar previous queries and return cached results if similarity is above a threshold (I use 0.95).

# Pseudocode — semantic cache lookup
query_embedding = embed(new_query)
cached = vector_store.search(query_embedding, threshold=0.95)
if cached:
    return cached.response  # free

Be conservative with the threshold. A 0.90 threshold sounds close but will serve wrong answers. I learned this the hard way with customer-facing responses.

Prompt caching (provider-level)

Anthropic and OpenAI both offer prompt caching now. If your system prompt is the same across calls (and it should be), cached input tokens cost 90% less. For a 2K-token system prompt across 25K daily calls, that's meaningful — roughly $700/month saved just from system prompt caching on Sonnet.

Enable it. It's almost free money.

Strategy 4: Prompt Engineering for Cost

Every token costs money. Treat your prompts like code — review them, optimise them, measure them.

What I changed:

Trimmed system prompts. My original system prompts were 3,000+ tokens of "be helpful, be accurate, consider edge cases..." I cut them to ~800 tokens with no quality difference. The models already know how to be helpful.
Stopped sending full conversation history. Instead of 20 messages of context, I send a summary of the conversation plus the last 3 messages. For a chatbot doing 10+ turns, this cuts input tokens by 60%.
Structured output requests. Instead of asking the model to explain its reasoning and then give an answer, I ask for JSON output directly. Shorter outputs = lower cost.
Removed redundant instructions. "Please respond in English" when the input is in English. "Be concise" followed by "provide a detailed explanation." Audit your prompts for contradictions and waste.

# Before: ~3200 input tokens
system = """You are a helpful AI assistant. You should always be accurate 
and provide detailed, well-structured responses. Consider edge cases.
Be polite. Format your response clearly. If you're unsure, say so.
Always respond in the same language as the user. Consider the context
of the conversation carefully before responding..."""

# After: ~600 input tokens  
system = """Extract entities from user text. Return JSON: 
{"entities": [{"text": str, "type": str, "confidence": float}]}
No explanation needed."""

Same task, 80% fewer tokens in, 90% fewer tokens out.

Strategy 5: Local Models — When They're Good Enough

I run Qwen 2.5 7B on an RTX 3060 via Ollama. It costs nothing per request and handles more than you'd think.

Where local models work well (7B–14B range):

Text classification: ~92% accuracy vs ~97% for Opus (good enough for routing)
Entity extraction: ~89% accuracy on my benchmark set
Reformatting/templating: essentially identical to cloud models
Simple Q&A over provided context: solid when context is clean

Where they fall apart:

Multi-step reasoning over large contexts
Nuanced creative writing
Complex code generation (fine for simple scripts)
Anything requiring broad world knowledge

The key insight: for many production tasks, 92% accuracy is fine. If you're classifying support tickets or extracting dates from emails, you don't need Claude Opus. You need something fast, cheap, and good enough.

Running local also gives you zero-latency calls (no network round-trip), full data privacy, and no rate limits. For high-throughput pipelines, this matters as much as cost.

Practical setup with Ollama:

# Install and run
ollama pull qwen2.5:7b
# Expose OpenAI-compatible API on :11434
# Point LiteLLM at it — done

For higher throughput, look at vLLM — it handles batching and continuous batching much better than Ollama for concurrent requests.

Real-World Results: The Multi-Agent System

Here's the before/after on my system — a multi-agent setup handling document processing, customer queries, and internal tooling:

Before (everything on Claude Opus):

~25,000 calls/day
Average 6K tokens per call (in+out)
~$2,400/day → ~$72,000/month

After (routed + cached + optimised):

60% routed to GPT-4o mini or local models: -$1,400/day
25% routed to Sonnet instead of Opus: -$200/day
Caching (exact + semantic) eliminated ~12% of calls: -$100/day
Prompt optimisation reduced average tokens by ~35%: spread across all tiers

Result: ~$700/day → ~$21,000/month

That's a 71% reduction. The quality metrics I track (task completion rate, user satisfaction scores, accuracy on a held-out test set) showed less than 2% degradation overall. The 15% of tasks still hitting Opus actually got better because I could afford to give them more context and retries.

The 80/20 Rule

If you take one thing from this post: audit your model usage before optimising anything else.

In almost every system I've seen, 80% of LLM calls are simple tasks running on expensive models because that's what was easiest to set up during development. Nobody goes back to optimise the model choice because it works fine — until the bill arrives.

Start here:

Log every LLM call with model, token count, task type, and cost
Classify tasks by actual complexity needed
Route the simple stuff to cheap/local models
Enable prompt caching (literally a config flag)
Trim your prompts — most are 2-3x longer than needed

Steps 1-3 will get you 50-60% of the savings. The rest is optimisation on top.

Tools Worth Knowing

LiteLLM: Unified API gateway, model routing, spend tracking, fallbacks. The single most useful tool for multi-model setups.
Ollama: Dead-simple local model serving. Pull a model, run it, done.
vLLM: Production-grade local inference with proper batching. Use when Ollama isn't enough.
OpenRouter: Single API for 100+ models with automatic fallbacks and cost comparison.

Running LLMs in production doesn't have to mean choosing between quality and cost. It means being intentional about which model handles which task — the same way you wouldn't use a GPU instance to serve static files.

The expensive model should be your scalpel, not your hammer.

Got questions or want to share your own cost optimisation stories? Drop a comment — I'd love to hear what's worked for you.

Top comments (4)

Vic Chen • Feb 16

Solid breakdown. The model routing strategy resonates a lot — I'm running a similar setup for a financial data pipeline (processing SEC 13F filings) and found nearly identical distribution: ~60% of tasks are simple extraction/classification that a small model handles perfectly.

One thing I'd add: semantic caching gets even more powerful when you combine it with structured output schemas. If you know the output format upfront (like your entity extraction example), you can cache at the schema level and get much higher hit rates than raw prompt similarity.

The LiteLLM recommendation is spot on. We switched to it about a month ago and the unified spend tracking alone was worth it — being able to see cost-per-task-type in a dashboard changed how we think about prompt design.

Curious about your experience with the 0.95 similarity threshold — have you tried adaptive thresholds based on task type? For classification tasks I've found 0.92 works fine, while for generation tasks you need 0.97+.

choutos • Feb 17

Thanks Vic - the schema-level caching idea is genuinely clever. We're doing semantic similarity on raw prompts but caching at the output schema level would definitely boost hit rates for structured extraction tasks. Going to experiment with that.

On adaptive thresholds - I haven't tried per-task-type tuning yet, but your numbers make intuitive sense. Classification is inherently more forgiving (there's only so many ways to say "this is a billing query"), while generation needs tighter matching or you get uncanny near-misses. I'll run some A/B tests with 0.92/0.97 splits and see what the cache hit rates look like.

Cheers for the SEC use case validation - always good to see the distribution holding up across different domains.

Vic Chen • Feb 17

Really glad the schema-level caching idea resonated! We actually stumbled onto it by accident — our SEC filing parser was generating nearly identical structured outputs for similar document types, and caching at the raw prompt level was missing all those hits.

The 0.92/0.97 split has been working well for us in production. One thing I'd add: we also found that maintaining a small "golden set" of cached outputs per task type helps catch regression when you tune thresholds. Basically a handful of known-good input/output pairs that you validate against whenever you adjust the similarity cutoff.

Would love to hear how your A/B tests turn out — especially curious if the hit rate improvement holds up at scale with more diverse prompt distributions.

Guilherme Zaia • Feb 17

Great insights on reducing costs in production AI systems! Your routing strategy is crucial; many teams overlook how fine-tuning model selection can drastically cut expenses. Have you considered integrating a similar market intelligence approach into your model routing? This could involve adjusting the routing weights dynamically based on real-time spend analysis or fluctuations in model performance. I’d love to see how you balance between model accuracy and cost efficiency in long-term use cases! 🔍