You're running AI in production. Things are going well — users love it, the team is shipping features, and then the invoice arrives.
$2,400/day on API calls. For what started as "a few GPT-4 calls here and there."
I've been there. Running a multi-agent system where every task — from simple text classification to complex reasoning — was hitting Claude Opus or GPT-4. The quality was great. The bill was not.
Over three months, I got that $2,400/day down to ~$700/day with no measurable quality loss on 94% of tasks. Here's exactly how.
The Cost Problem: Let's Talk Numbers
First, let's ground this in reality. Current pricing (as of early 2026):
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| GPT-4o | $2.50 | $10.00 |
| Claude Opus 4 | $15.00 | $75.00 |
| Claude Sonnet 4 | $3.00 | $15.00 |
| Claude Haiku | $0.25 | $1.25 |
| GPT-4o mini | $0.15 | $0.60 |
| Qwen 2.5 7B (local) | $0.00 | $0.00* |
*Hardware costs apply, but if you already have a GPU sitting around, marginal cost is electricity.
A single Claude Opus call with a typical system prompt (~2K tokens) plus context (~3K tokens) generating a ~1K token response costs roughly $0.10. That sounds tiny until you're making 25,000 calls a day.
The maths is simple but brutal: most production AI systems are running their most expensive model on tasks that don't need it.
Strategy 1: Model Routing — The Biggest Win
This single change cut my costs by ~45%.
The idea: classify incoming tasks by complexity, then route to the appropriate model. Not every question needs a PhD — some just need a lookup.
# Simplified routing logic
def route_request(task: dict) -> str:
complexity = estimate_complexity(task)
if complexity == "simple":
# Classification, extraction, formatting, simple Q&A
return "gpt-4o-mini" # or local model
elif complexity == "medium":
# Summarisation, code review, standard generation
return "claude-sonnet-4-20250514"
else:
# Complex reasoning, multi-step analysis, creative work
return "claude-opus-4-20250514"
The complexity estimator doesn't need to be fancy. In my case, a simple heuristic based on task type, input length, and whether the task requires multi-step reasoning got me 90% of the way there. You can even use a cheap model to classify — a GPT-4o mini call to decide routing costs fractions of a cent.
What I found in practice:
- ~60% of requests were "simple" — classification, entity extraction, formatting, template filling
- ~25% were "medium" — summarisation, standard content generation, code explanation
- ~15% actually needed top-tier reasoning
That means 60% of my spend was going to a model 100x more expensive than necessary.
Strategy 2: Fallback Chains
Model routing handles the happy path. Fallback chains handle everything else — rate limits, outages, and cost control.
Primary: Claude Opus 4
↓ (rate limited or timeout)
Secondary: Claude Sonnet 4
↓ (API down or budget exceeded)
Tertiary: Local Qwen 2.5 7B via Ollama
I use LiteLLM as the routing layer. It gives you a unified OpenAI-compatible API across providers with built-in fallbacks, retries, and spend tracking.
# litellm config
model_list:
- model_name: reasoning-heavy
litellm_params:
model: anthropic/claude-opus-4-20250514
max_budget: 500 # daily cap in USD
- model_name: reasoning-heavy
litellm_params:
model: anthropic/claude-sonnet-4-20250514 # fallback
- model_name: simple-tasks
litellm_params:
model: ollama/qwen2.5:7b
api_base: http://localhost:11434
The daily budget cap is crucial. Once your primary model hits spend limits, requests automatically fall through to cheaper alternatives. You get cost predictability without building it yourself.
Strategy 3: Smart Caching
You'd be surprised how many "unique" requests are actually near-duplicates.
Exact caching
The low-hanging fruit. Hash the prompt, cache the response. If someone asks the same question twice, don't pay twice. I use Redis with a 24-hour TTL. This alone saved ~8% on costs.
Semantic caching
More interesting. Use embeddings to find semantically similar previous queries and return cached results if similarity is above a threshold (I use 0.95).
# Pseudocode — semantic cache lookup
query_embedding = embed(new_query)
cached = vector_store.search(query_embedding, threshold=0.95)
if cached:
return cached.response # free
Be conservative with the threshold. A 0.90 threshold sounds close but will serve wrong answers. I learned this the hard way with customer-facing responses.
Prompt caching (provider-level)
Anthropic and OpenAI both offer prompt caching now. If your system prompt is the same across calls (and it should be), cached input tokens cost 90% less. For a 2K-token system prompt across 25K daily calls, that's meaningful — roughly $700/month saved just from system prompt caching on Sonnet.
Enable it. It's almost free money.
Strategy 4: Prompt Engineering for Cost
Every token costs money. Treat your prompts like code — review them, optimise them, measure them.
What I changed:
Trimmed system prompts. My original system prompts were 3,000+ tokens of "be helpful, be accurate, consider edge cases..." I cut them to ~800 tokens with no quality difference. The models already know how to be helpful.
Stopped sending full conversation history. Instead of 20 messages of context, I send a summary of the conversation plus the last 3 messages. For a chatbot doing 10+ turns, this cuts input tokens by 60%.
Structured output requests. Instead of asking the model to explain its reasoning and then give an answer, I ask for JSON output directly. Shorter outputs = lower cost.
Removed redundant instructions. "Please respond in English" when the input is in English. "Be concise" followed by "provide a detailed explanation." Audit your prompts for contradictions and waste.
# Before: ~3200 input tokens
system = """You are a helpful AI assistant. You should always be accurate
and provide detailed, well-structured responses. Consider edge cases.
Be polite. Format your response clearly. If you're unsure, say so.
Always respond in the same language as the user. Consider the context
of the conversation carefully before responding..."""
# After: ~600 input tokens
system = """Extract entities from user text. Return JSON:
{"entities": [{"text": str, "type": str, "confidence": float}]}
No explanation needed."""
Same task, 80% fewer tokens in, 90% fewer tokens out.
Strategy 5: Local Models — When They're Good Enough
I run Qwen 2.5 7B on an RTX 3060 via Ollama. It costs nothing per request and handles more than you'd think.
Where local models work well (7B–14B range):
- Text classification: ~92% accuracy vs ~97% for Opus (good enough for routing)
- Entity extraction: ~89% accuracy on my benchmark set
- Reformatting/templating: essentially identical to cloud models
- Simple Q&A over provided context: solid when context is clean
Where they fall apart:
- Multi-step reasoning over large contexts
- Nuanced creative writing
- Complex code generation (fine for simple scripts)
- Anything requiring broad world knowledge
The key insight: for many production tasks, 92% accuracy is fine. If you're classifying support tickets or extracting dates from emails, you don't need Claude Opus. You need something fast, cheap, and good enough.
Running local also gives you zero-latency calls (no network round-trip), full data privacy, and no rate limits. For high-throughput pipelines, this matters as much as cost.
Practical setup with Ollama:
# Install and run
ollama pull qwen2.5:7b
# Expose OpenAI-compatible API on :11434
# Point LiteLLM at it — done
For higher throughput, look at vLLM — it handles batching and continuous batching much better than Ollama for concurrent requests.
Real-World Results: The Multi-Agent System
Here's the before/after on my system — a multi-agent setup handling document processing, customer queries, and internal tooling:
Before (everything on Claude Opus):
- ~25,000 calls/day
- Average 6K tokens per call (in+out)
- ~$2,400/day → ~$72,000/month
After (routed + cached + optimised):
- 60% routed to GPT-4o mini or local models: -$1,400/day
- 25% routed to Sonnet instead of Opus: -$200/day
- Caching (exact + semantic) eliminated ~12% of calls: -$100/day
- Prompt optimisation reduced average tokens by ~35%: spread across all tiers
Result: ~$700/day → ~$21,000/month
That's a 71% reduction. The quality metrics I track (task completion rate, user satisfaction scores, accuracy on a held-out test set) showed less than 2% degradation overall. The 15% of tasks still hitting Opus actually got better because I could afford to give them more context and retries.
The 80/20 Rule
If you take one thing from this post: audit your model usage before optimising anything else.
In almost every system I've seen, 80% of LLM calls are simple tasks running on expensive models because that's what was easiest to set up during development. Nobody goes back to optimise the model choice because it works fine — until the bill arrives.
Start here:
- Log every LLM call with model, token count, task type, and cost
- Classify tasks by actual complexity needed
- Route the simple stuff to cheap/local models
- Enable prompt caching (literally a config flag)
- Trim your prompts — most are 2-3x longer than needed
Steps 1-3 will get you 50-60% of the savings. The rest is optimisation on top.
Tools Worth Knowing
- LiteLLM: Unified API gateway, model routing, spend tracking, fallbacks. The single most useful tool for multi-model setups.
- Ollama: Dead-simple local model serving. Pull a model, run it, done.
- vLLM: Production-grade local inference with proper batching. Use when Ollama isn't enough.
- OpenRouter: Single API for 100+ models with automatic fallbacks and cost comparison.
Running LLMs in production doesn't have to mean choosing between quality and cost. It means being intentional about which model handles which task — the same way you wouldn't use a GPU instance to serve static files.
The expensive model should be your scalpel, not your hammer.
Got questions or want to share your own cost optimisation stories? Drop a comment — I'd love to hear what's worked for you.
Top comments (4)
Solid breakdown. The model routing strategy resonates a lot — I'm running a similar setup for a financial data pipeline (processing SEC 13F filings) and found nearly identical distribution: ~60% of tasks are simple extraction/classification that a small model handles perfectly.
One thing I'd add: semantic caching gets even more powerful when you combine it with structured output schemas. If you know the output format upfront (like your entity extraction example), you can cache at the schema level and get much higher hit rates than raw prompt similarity.
The LiteLLM recommendation is spot on. We switched to it about a month ago and the unified spend tracking alone was worth it — being able to see cost-per-task-type in a dashboard changed how we think about prompt design.
Curious about your experience with the 0.95 similarity threshold — have you tried adaptive thresholds based on task type? For classification tasks I've found 0.92 works fine, while for generation tasks you need 0.97+.
Thanks Vic - the schema-level caching idea is genuinely clever. We're doing semantic similarity on raw prompts but caching at the output schema level would definitely boost hit rates for structured extraction tasks. Going to experiment with that.
On adaptive thresholds - I haven't tried per-task-type tuning yet, but your numbers make intuitive sense. Classification is inherently more forgiving (there's only so many ways to say "this is a billing query"), while generation needs tighter matching or you get uncanny near-misses. I'll run some A/B tests with 0.92/0.97 splits and see what the cache hit rates look like.
Cheers for the SEC use case validation - always good to see the distribution holding up across different domains.
Really glad the schema-level caching idea resonated! We actually stumbled onto it by accident — our SEC filing parser was generating nearly identical structured outputs for similar document types, and caching at the raw prompt level was missing all those hits.
The 0.92/0.97 split has been working well for us in production. One thing I'd add: we also found that maintaining a small "golden set" of cached outputs per task type helps catch regression when you tune thresholds. Basically a handful of known-good input/output pairs that you validate against whenever you adjust the similarity cutoff.
Would love to hear how your A/B tests turn out — especially curious if the hit rate improvement holds up at scale with more diverse prompt distributions.
Great insights on reducing costs in production AI systems! Your routing strategy is crucial; many teams overlook how fine-tuning model selection can drastically cut expenses. Have you considered integrating a similar market intelligence approach into your model routing? This could involve adjusting the routing weights dynamically based on real-time spend analysis or fluctuations in model performance. I’d love to see how you balance between model accuracy and cost efficiency in long-term use cases! 🔍