DEV Community

Cover image for AI gateways: why and how
Nicolas Fränkel
Nicolas Fränkel

Posted on • Originally published at blog.frankel.ch

AI gateways: why and how

Governance and provider independence

Before working for 2 years on the Apache APISIX API gateway, I was mainly oblivious to API gateways. It's only by working with them that I understood their value. Decoupling the client and the server unlocks a lot of options: moving authentication to the API Gateway, securing APIs, deduplicating API requests, etc.

In this post, I want to describe how the same pattern applies to AI.

AI gateways

AI gateways work in a similar way. They proxy requests and responses between an AI client and its LLM backend(s). Likewise, it unlocks many improvements:

  • AI management: single glass pane to address multiple AI models and providers, simplifying the complexity of integrating and switching between different services.
  • Compliance and governance: enforce security, data privacy, and compliance in a single place.
  • Cost control: intelligent routing, semantic caching, and budget management.
  • Avoiding lock-in: clients depend on an abstraction under control; the gateway manages server API updates and even migrations to a new provider.
  • Scalability and reliability: supports automatic failover and load balancing.
  • Observability: need I say more?

My use-case

I love Claude Code; it's amazing, and my experience is more than positive. It has one problem, though: Anthropic is a US-based company, and subject to the Patriot Act. By design, Anthropic must comply with any US government request to access all of my data. As a non-US citizen, I have no right to object. I have no right to be told.

On the other hand, Mistral AI is a European Union-based company. I have heard pretty good stuff about it, especially devstral:

Today we introduce Devstral, our agentic LLM for software engineering tasks. Devstral is built under a collaboration between Mistral AI and All Hands AI 🙌, and outperforms all open-source models on SWE-Bench Verified by a large margin. We release Devstral under the Apache 2.0 license.

Devstral SWE

Devstral

It would be amazing if I could use Claude Code while routing the calls to devstral. AI gateways are a perfect fit.

Available AI gateways

In case you are in need of an AI gateway, please do your due diligence. Because it's exploratory work, I did a pretty shallow search, not deep research. These are the ones that surfaced.

  • LiteLLM:

LiteLLM is an open-source library that gives you a single, unified interface to call 100+ LLMs — OpenAI, Anthropic, Vertex AI, Bedrock, and more — using the OpenAI format.

  • Call any provider using the same completion() interface — no re-learning the API for each one
  • Consistent output format regardless of which provider or model you use
  • Built-in retry / fallback logic across multiple deployments via the Router
  • Self-hosted LLM Gateway (Proxy) with virtual keys, cost tracking, and an admin UI

Getting Started

  • Bifrost:

Bifrost is a high-performance AI gateway that unifies access to 20+ providers OpenAI, Anthropic, AWS Bedrock, Google Vertex, Azure, and more, through a unified API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade governance. In sustained benchmarks at 5,000 requests per second, Bifrost adds only 11 µs of overhead per request.

Overview

  • OpenRouter:

OpenRouter provides a unified API that gives you access to hundreds of AI models through a single endpoint, while automatically handling fallbacks and selecting the most cost-effective options. Get started with just a few lines of code using your preferred SDK or framework.

Quickstart

I chose Bifrost for several reasons. It's self-hosted, it's close to the metal (it's written in Go), and the quickstart is dead simple via npx. It is also available via a Docker image. It has a UI, which is great for exploring features. Finally, it has baked-in observability.

Bifrost 5-minute quick start

Starting Bifrost is a single command:

npx -y @maximhq/bifrost
Enter fullscreen mode Exit fullscreen mode

The menu's first item shows the telemetry dashboard. It's quite useful.

The first step is to create a model provider for Mistral. Go to Models > Model Providers. Then click on + Add New Provider. Finally, choose + Add new key and fill the fields accordingly. You should have previously generated a Mistral API key.

The second step is to create a routing rule so that Claude Code requests targeting any of Anthropic's models target Mistral instead. Go to Models > Routing Rules. Click on + New rule, then inside the opening menu, on + Add Target.

  • Provider: Mistral AI
  • Model: devstral-2512

Leave all other fields blank.

Before launching Claude Claude, export the following environment variables:

export ANTHROPIC_BASE_URL=http://localhost:8080/anthropic     # 1
export ANTHROPIC_AUTH_TOKEN=sk-dummy                          # 2
Enter fullscreen mode Exit fullscreen mode
  1. Use the local Bifrost instead of the default Anthropic URL
  2. Unused, but necessary

It should just be a matter of starting Claude Code and having fun. Unfortunately, it fails.

❯ hello
  ⎿  API Error: 422 {"type":"error","error":{"type":"api_error","message":"provider API error (status 422)"}}
Enter fullscreen mode Exit fullscreen mode

Debugging the setup

Failures are always great opportunities to understand a solution better. On the provider, click on Edit Provider Config. Set the switch of the Debugging tab.

Then, in Observability > LLM Logs, select the failing request. Find the section named Raw Response from Mistral. It looks something like this:

{
  "object": "error",
  "message": {
    "detail": [
      {
        "type": "enum",
        "loc": [
          "body",
          "reasoning_effort"
        ],
        "msg": "Input should be 'none' or 'high'",
        "input": "medium",
        "ctx": {
          "expected": "'none' or 'high'"                // 1
        }
      }
    ]
  },
  "type": "invalid_request_error",
  "param": null,
  "code": null,
  "raw_status_code": 422
}
Enter fullscreen mode Exit fullscreen mode
  1. There's a mismatch between the request sent by Claude Code and what Mistral expects.

At this point, I had pinpointed the issue. I opened a bug report. The feedback came back in less than an hour, and the fix attempt was in a couple of weeks. It still doesn't work, but I found a workaround.

export CLAUDE_CODE_DISABLE_THINKING=1
Enter fullscreen mode Exit fullscreen mode

Extended thinking is the reasoning Claude emits before responding. On models that support adaptive reasoning, the effort level is the primary control for how much thinking happens; the settings below turn thinking on or off and control how it displays.

Extended thinking

At this point, it works, but at the cost of a lack of higher reasoning.

Request log in Bifrost

Budget management

The above is good for a personal setup, but in enterprise settings, you'll rarely see this approach, if ever. What you regularly see, however, is a monthly spending limit. This situation has two issues:

  • Users must be able to track their usage. Without precise tracking, they run the risk of consuming too many tokens too early.
  • If you overspend, you are stuck until the counter resets, which generally happens at the start of the next month. Ask me how I know.

AI gateways in general, and Bifrost in particular, allow lots of options to handle both. I'll leave the metering aside for this post to focus on the usage itself. The AI gateway could do a couple of things to help manage the budget.

  • Set a daily cap. It's better to be capped for half an hour at the end of the day than for more than a day at the end of the month.
  • Redirect request using specific models to other models. You can use it to prevent users from using expensive models. At the moment of this writing, Claude Opus tokens cost at least 5 times more than Claude Sonnet's.
  • Degrade model when over-budget. Once users have gone over a specific budget, redirect requests to a less expensive model.
  • Redirect to self-hosted model when over-budget.

These are just a couple of the use cases that you could implement.

Simple fallback

In this section, I want to demo the last item. In an enterprise setting, you could host the fallback on a cloud instance or even self-host. In the context of this post, I'll limit myself to falling back to a local llama-server if above the set limit.

The first step is to define a spending limit. Go to the Mistral provider defined earlier. Set the limit in the Governance tab. Here's where I set the cap to 100,000 tokens daily for demo purposes.

Set a token limit for Mistral AI

The second step is to add a Llama Server provider. It's type custom.

  • API Structure: The base is OpenAI, because llama-server is compatible. Remember to limit the Allowed Request Types to what the model can actually do. For my model, it's Chat Completion and Chat Completion Stream.
  • Network: Set the Base URL to http://localhost:<port>. Because Bifrost binds on port 8080, I run llama-server on 8081.

The third and final step is to add a fallback to the existing routing rule.

Set a routing fallback

With the low number of tokens set, you'll hit the ceiling after one or two requests.

Error message when the token cap has been reached

On the next request, Bifrost will:

  1. Evaluate the condition
  2. Prevent the request from reaching Mistral
  3. And send it to the fallback Llama server instead.

Bifrost logs the two requests, the first one to Mistral AI and the second to Llama. Here's a sample:

Request logs

Conclusion

AI gateways are a fantastic architectural improvement. Bifrost looks great on paper, and the dashboard seems quite exhaustive.

Unfortunately, a Bifrost bug currently prevents the implementation of my extended thinking. I hope maintainers fix it fast and I can work further on AI gateways "for real".

However, when it works, I'll be able to leverage Claude Code's amazing client with the LLM of my choice, including Devstral.

To go further:


Originally published at A Java Geek on May 31st, 2026.

Top comments (17)

Collapse
 
francistrdev profile image
FrancisTRᴅᴇᴠ (っ◔◡◔)っ

Interesting read! Well written! It took me a while since I got distracted by your Cover Image on this post lol (is that hulk?)

No offense to the Cover Image by the way! Just something I notice :)

Again, well done :D

Collapse
 
xulingfeng profile image
xulingfeng

The governance angle is the part most people overlook with AI gateways — they focus on provider switching and cost optimization, not on whether the system can explain its decisions. Data sovereignty (Mistral vs Claude) is a very real concern too.

Curious about your testing approach: how do you verify that the gateway actually enforces the policies described, beyond just routing traffic correctly?

Collapse
 
nfrankel profile image
Nicolas Fränkel

how do you verify that the gateway actually enforces the policies described

How do you verify that any of your infrastructure component does its job?

Collapse
 
xulingfeng profile image
xulingfeng

Fair point, but one thing keeps bugging me.
Traditional middleware enforces config — you test it once, you trust it. AI gateways have policies that are classification problems (PII scanning, output guardrails), not config. Model output is stochastic, so classifications have false negatives.
How do you audit trust in a probabilistic decision?
Have you seen any gateway tackle this distinction?

Thread Thread
 
nfrankel profile image
Nicolas Fränkel

I'm afraid that we disagree on your assumptions.

Gateways are gateways: glorified proxies that enforce a single entry point. At this point, what feature you use is up to you. Observability, authorization, etc. are common features, in both API and AI gateways.

The "probabilistic nature" comes from the back-end part, not from the gateway part.

Thread Thread
 
xulingfeng profile image
xulingfeng

You're right that structurally, a gateway is a gateway — the proxy layer itself doesn't introduce probability. But the nature of the backend changes what the gateway needs to do. A traditional API gateway can assume deterministic responses and focus on routing, auth, rate limiting, and caching by exact key. An AI gateway, sitting in front of a probabilistic backend, also needs response validation, semantic caching, cost-per-token tracking, and content-based fallback. Not because the gateway became probabilistic — because the thing it's proxying is. Same architectural pattern, different operational surface area.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

ran one of these in front of an agent fleet - rate limiting alone was worth it. policy enforcement was the bonus I did not expect to care about.

Collapse
 
nfrankel profile image
Nicolas Fränkel

Yes, gateways unlock so many benefits you don't expect.

You barely understand how you could work without them before.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

cost attribution per agent was the one i did not see coming. zero visibility into which worker was burning budget until the gateway showed up. now it is the first thing i check on a long run.

Collapse
 
merbayerp profile image
Mustafa ERBAY

I think AI gateways will eventually become a standard layer in enterprise AI architectures.
Not because organizations need access to more models, but because they need control.
Routing.
Governance.
Observability.
Budget enforcement.
Fallback strategies.
Provider independence.
The more AI becomes infrastructure, the more these concerns start looking less like AI problems and more like classic systems engineering problems.

Collapse
 
nfrankel profile image
Nicolas Fränkel

Pretty much.

Collapse
 
tokenmixai profile image
tokenmixai

Solid walkthrough — especially the Bifrost fallback config section.
The "self-hosted Go binary with built-in observability" reasoning is
exactly why teams pick it over LiteLLM-as-library.
One framing worth adding: the hosted-gateway space is now three lanes,
not two.

  1. Self-hosted (Bifrost, LiteLLM-as-library): full control, $0 gateway cost, your engineering time
  2. Hosted with markup (OpenRouter): zero ops, ~5.5% on top of provider rates
  3. Hosted at pass-through rates: TokenMix sits here — same OpenAI- compatible endpoint pattern, ~300 models (Claude, OpenAI, Gemini, DeepSeek, Qwen, Kimi), but provider pricing without markup Most "AI gateway" posts default to comparing lane 1 vs lane 2. Lane 3 is the missing option for teams that don't want to run infra but also don't want a routing tax on every token. One thing worth adding to your governance section: cost attribution across providers is the actual reason most teams adopt a gateway — not the routing or fallback features. Single bill across all providers is the unsexy but real adoption driver. (Disclosure: I work on the TokenMix research side.)
Collapse
 
0xdevc profile image
NOVAInetwork

The Mistral reasoning_effort enum mismatch is the part of this story I would dig into more. Gateway abstractions promise unified APIs but the actual provider schemas drift constantly, especially around newer fields like reasoning controls.

Your CLAUDE_CODE_DISABLE_THINKING=1 workaround gets you working but at the cost of the exact capability you wanted from a stronger model.

The deeper question is whose job it is to normalize. Three options I have seen attempted: the gateway translates per-provider (Bifrost would have to ship a Mistral-specific adapter for the enum), the client adapts to a lowest-common-denominator schema (which is what your env var workaround amounts to), or the provider conforms upstream (Mistral could accept "medium" and silently downcast). All three are bad in different ways and the right answer probably depends on whether the gateway is supposed to be transparent or opinionated.

The Patriot Act framing at the top of your post is also doing more work than people might notice. Provider independence is usually pitched as cost or reliability. Data sovereignty is a different axis and I think it is going to be the actual driver for European enterprises in 2026, not the cost optimization that gets the marketing attention.

Collapse
 
alexshev profile image
Alex Shev

AI gateways make more sense once teams move past experiments. Routing is useful, but the bigger value is policy, observability, cost control, and having one place to understand what the app is asking models to do.

Collapse
 
crenshinibon profile image
Dirk Porsche

Ditch Claude Code and use OpenCode ... it's even better and you can skip all that setup that will probably not yield good results anyway. Gemini strongly advises against this, because Claude Code and the Anthropic models are deeply entwined ... for a purpose ... locking-you-in ... so liberate yourself.

Collapse
 
jasmine_park_dev profile image
Jasmine Park

Good overview. The gateway benefit that gets underrated in these discussions is cost attribution: it is the one place every model call from every framework actually passes through, so it is the natural place to stamp team, feature, and request identifiers before the provider sees the call. We tried doing that attribution at the application layer first and every framework needed its own instrumentation; at the gateway it was one change. Routing and caching get the headlines but the accounting alone justified ours

Some comments may only be visible to logged-in visitors. Sign in to view all comments.