Building a Unified AI Gateway: "Ollama First" Architecture

#python #ai #programming #opensource

In the rapidly evolving world of Large Language Models (LLMs), developers often face a dilemma: lock into a single provider like OpenAI, or juggle multiple different APIs (Anthropic, Mistral, Local LLMs).

Today, I built a Minimal Unified Model Gateway in Python that solves this by providing a single, OpenAI-compatible endpoint that intelligently routes traffic to the best model for the job—whether it's running in the cloud or locally on your machine via Ollama.

🏗️ The Architecture

The system is built on FastAPI for high performance and uses Rediz (optional) for caching. Here's how a request flows through the system:

1. The Core: OpenAI Compatibility

I chose to mimic the OpenAI API standard (/v1/chat/completions). This means you can use the official OpenAI Python/Node.js SDKs, LangChain, or any other tool that supports OpenAI, simply by changing the base_url.

2. The Smart Router

The heart of the gateway is router.py. It doesn't just pass requests blindly; it inspects them.

Intent Detection: If your prompt contains code keywords (def, class, import), it routes to a specialized coding model (e.g., qwen3:4b).
Optimization: Short, simple prompts are routed to a smaller, faster model (e.g., gemma3:4b), saving compute and reducing latency.
Reliability: If your primary local model is overloaded or fails, the gateway automatically falls back to a backup model or even a cloud provider.

3. The Adapter Pattern

To support multiple providers without spaghetti code, I implemented an Adapter Pattern.

ModelAdapter (Abstract Base Class): Defines the contract (generate method).
OllamaAdapter: Handles communication with local Ollama instances and translates formatting quirks.
OpenAIAdapter: Connects to any OpenAI-compatible API (GPT-4, vLLM, Groq). Adding a new provider (like Anthropic) is as simple as adding a new class file.

4. Observability & Caching

SQLite Logging: Every request, token usage, and latency metric is logged to a local SQLite database, giving you full visibility into your AI traffic.
Redis Caching: Identical requests are cached (based on a hash of the prompt and parameters), providing instant responses and saving costs.

🚀 Why This Matters

For developers, this means freedom. You can start building with a local Llama 3 model for free, and swap it out for GPT-4 in production without changing a single line of your application code. You can A/B test models, failover gracefully, and optimize costs dynamically.

This gateway represents a "Local First" approach to AI development—powerful, private, and independent.