In the rapidly evolving world of Large Language Models (LLMs), developers often face a dilemma: lock into a single provider like OpenAI, or juggle multiple different APIs (Anthropic, Mistral, Local LLMs).
Today, I built a Minimal Unified Model Gateway in Python that solves this by providing a single, OpenAI-compatible endpoint that intelligently routes traffic to the best model for the job—whether it's running in the cloud or locally on your machine via Ollama.
🏗️ The Architecture
The system is built on FastAPI for high performance and uses Rediz (optional) for caching. Here's how a request flows through the system:
1. The Core: OpenAI Compatibility
I chose to mimic the OpenAI API standard (/v1/chat/completions). This means you can use the official OpenAI Python/Node.js SDKs, LangChain, or any other tool that supports OpenAI, simply by changing the base_url.
2. The Smart Router
The heart of the gateway is router.py. It doesn't just pass requests blindly; it inspects them.
-
Intent Detection: If your prompt contains code keywords (
def,class,import), it routes to a specialized coding model (e.g.,qwen3:4b). -
Optimization: Short, simple prompts are routed to a smaller, faster model (e.g.,
gemma3:4b), saving compute and reducing latency. - Reliability: If your primary local model is overloaded or fails, the gateway automatically falls back to a backup model or even a cloud provider.
3. The Adapter Pattern
To support multiple providers without spaghetti code, I implemented an Adapter Pattern.
-
ModelAdapter(Abstract Base Class): Defines the contract (generatemethod). -
OllamaAdapter: Handles communication with local Ollama instances and translates formatting quirks. -
OpenAIAdapter: Connects to any OpenAI-compatible API (GPT-4, vLLM, Groq). Adding a new provider (like Anthropic) is as simple as adding a new class file.
4. Observability & Caching
- SQLite Logging: Every request, token usage, and latency metric is logged to a local SQLite database, giving you full visibility into your AI traffic.
- Redis Caching: Identical requests are cached (based on a hash of the prompt and parameters), providing instant responses and saving costs.
🚀 Why This Matters
For developers, this means freedom. You can start building with a local Llama 3 model for free, and swap it out for GPT-4 in production without changing a single line of your application code. You can A/B test models, failover gracefully, and optimize costs dynamically.
This gateway represents a "Local First" approach to AI development—powerful, private, and independent.
🔗 Code & Usage
Check out the full implementation on GitHub: https://github.com/harishkotra/unified-gateway

Top comments (0)