DEV Community

Cover image for Building a Unified AI Gateway: "Ollama First" Architecture
Harish Kotra (he/him)
Harish Kotra (he/him)

Posted on

Building a Unified AI Gateway: "Ollama First" Architecture

In the rapidly evolving world of Large Language Models (LLMs), developers often face a dilemma: lock into a single provider like OpenAI, or juggle multiple different APIs (Anthropic, Mistral, Local LLMs).

Today, I built a Minimal Unified Model Gateway in Python that solves this by providing a single, OpenAI-compatible endpoint that intelligently routes traffic to the best model for the job—whether it's running in the cloud or locally on your machine via Ollama.

🏗️ The Architecture

The system is built on FastAPI for high performance and uses Rediz (optional) for caching. Here's how a request flows through the system:

Architecture

1. The Core: OpenAI Compatibility

I chose to mimic the OpenAI API standard (/v1/chat/completions). This means you can use the official OpenAI Python/Node.js SDKs, LangChain, or any other tool that supports OpenAI, simply by changing the base_url.

2. The Smart Router

The heart of the gateway is router.py. It doesn't just pass requests blindly; it inspects them.

  • Intent Detection: If your prompt contains code keywords (def, class, import), it routes to a specialized coding model (e.g., qwen3:4b).
  • Optimization: Short, simple prompts are routed to a smaller, faster model (e.g., gemma3:4b), saving compute and reducing latency.
  • Reliability: If your primary local model is overloaded or fails, the gateway automatically falls back to a backup model or even a cloud provider.

3. The Adapter Pattern

To support multiple providers without spaghetti code, I implemented an Adapter Pattern.

  • ModelAdapter (Abstract Base Class): Defines the contract (generate method).
  • OllamaAdapter: Handles communication with local Ollama instances and translates formatting quirks.
  • OpenAIAdapter: Connects to any OpenAI-compatible API (GPT-4, vLLM, Groq). Adding a new provider (like Anthropic) is as simple as adding a new class file.

4. Observability & Caching

  • SQLite Logging: Every request, token usage, and latency metric is logged to a local SQLite database, giving you full visibility into your AI traffic.
  • Redis Caching: Identical requests are cached (based on a hash of the prompt and parameters), providing instant responses and saving costs.

🚀 Why This Matters

For developers, this means freedom. You can start building with a local Llama 3 model for free, and swap it out for GPT-4 in production without changing a single line of your application code. You can A/B test models, failover gracefully, and optimize costs dynamically.

This gateway represents a "Local First" approach to AI development—powerful, private, and independent.

🔗 Code & Usage

Check out the full implementation on GitHub: https://github.com/harishkotra/unified-gateway

Top comments (0)