Dor Amir

Posted on Feb 17

I kept hitting the "quota wall" with AI coding tools. So I built a router.

#llm #python #opensource #openai

If you use OpenClaw, Codex, Cursor, or Claude Code, you've probably seen the same thing: your premium model quota disappears halfway through the week.

I finally looked at my own usage logs and the cause was obvious:

Most prompts weren't "hard." They were boring utility work:

"Summarize this paragraph"
"Reformat this JSON"
"Explain this error"
"Convert this to TypeScript"

Those prompts don't need Opus or GPT-4 class models. But they were getting sent there anyway, burning quota like it's free.

What I found in my data

After a week of logs, the distribution was consistent:

~55% simple: summarize / format / translate / explain
~25% medium: write a function / debug
~15% complex: refactor / architecture
~5% reasoning-heavy: tradeoffs / proof-style reasoning

So the real problem wasn't "I need more quota."
It was "my tools don't know when not to use the premium model."

NadirClaw: open-source LLM routing as a local proxy

I built NadirClaw, an OpenAI-compatible proxy that sits between your AI tool and your model providers.

It classifies each request in ~10ms (sentence embeddings) and routes automatically:

Simple → Gemini Flash or Ollama (free/cheap)
Complex → Sonnet (or your preferred premium)
Reasoning-heavy → Opus / GPT-class reasoning models
Agentic prompts → auto-detected and forced to premium (cheap models fall apart here)

The features that matter

Agentic detection: tool-use + loop-style prompts route to premium automatically
Session pinning: multi-turn chats stay on one model (no context breakage)
429 fallback: rate-limit hits fail over instead of breaking your flow
Routing profiles: auto, eco, premium, free, reasoning
Reporting: breakdown of tiers, latency, and token usage from JSONL logs

Results (after 2 weeks)

Premium quota lasted the full week (instead of dying by Wednesday)
~60% of prompts moved to free/cheap models with no noticeable quality drop
Added overhead stayed ~8-12ms per request

Works with your existing tools

Because it speaks the OpenAI Chat Completions API, it works with:
OpenClaw, Codex, Cursor, Claude Code, Continue, and most OpenAI-compatible clients.

Providers supported: Gemini (native), OpenAI, Anthropic, Ollama, and anything LiteLLM supports.

Quick start

pip install nadirclaw
nadirclaw serve --verbose

GitHub: https://github.com/doramirdor/NadirClaw

MIT licensed. Actively maintained.

If you're drowning in quota burn, this is the fix: route the easy stuff away from premium models automatically.

Top comments (1)

signalstack • Feb 17

This is exactly the problem we ran into when building multi-agent pipelines. The naive approach — one model for everything — burns through quota fast and adds unnecessary latency for trivial tasks.

A few things we learned the hard way:

Classification latency matters a lot at scale. 10ms is solid, but once you're routing thousands of requests per hour, even your classifier becomes a bottleneck if it's not async. We ended up batching low-priority classification jobs.

Failure modes between providers are different, not just in rate limits but in how they fail. Anthropic tends to return 529s under load while OpenAI is more likely to quietly degrade quality. Took us a while to build detection for the second case.

The 70/20/10 distribution you found matches what we see too. The real unlock is building confidence thresholds — if the router isn't sure, default to premium. False economy to save quota and get garbage output on something borderline.

Have you looked at adding feedback loops? If a "lite" routed request comes back with low quality markers (short output, high refusal rate, user correction patterns), automatically escalating those task types over time would make the classification smarter without manual tuning.