DEV Community

Cover image for Building a Production-Ready AI Gateway in ASP.NET Core
Ashok Kanjarla
Ashok Kanjarla

Posted on

Building a Production-Ready AI Gateway in ASP.NET Core

Introduction

AI integration in backend systems looks deceptively simple at first glance. You inject an OpenAI client, call GenerateAsync, and return the result. In development, everything behaves predictably. Under light load, the system performs well. The abstraction feels clean.

As usage increased within our ASP.NET Core services, AI quickly evolved from a helpful feature into a systemic risk. It became a latency bottleneck during peak traffic, a cost multiplier as token usage scaled, a failure amplifier whenever the provider experienced issues, and a debugging challenge when unexpected behavior surfaced in production.

We were handling thousands of concurrent requests per hour. External LLM providers introduced unpredictable latency spikes, occasional rate limits, and intermittent outages. Each of these external behaviors directly affected our API performance because we had tightly coupled AI calls to our request pipeline.

Calling providers directly from controllers created architectural fragility. What looked convenient initially was in fact exposing our core system to external volatility.

We redesigned the system around one core principle:

AI is not a feature. It is a distributed dependency.


The Anti-Pattern: Direct Controller Calls

Our initial implementation followed a common pattern:

[HttpPost("generate")]
public async Task<IActionResult> Generate([FromBody] PromptRequest request)
{
    var result = await _openAiService.GenerateAsync(request.Prompt);
    return Ok(result);
}
Enter fullscreen mode Exit fullscreen mode

From a surface-level perspective, this approach appears clean and efficient. However, at scale, this design introduces multiple systemic weaknesses.

There were no retry policies to handle transient network failures. There was no circuit breaker to protect the system during sustained provider outages. The implementation lacked provider abstraction, meaning vendor changes would ripple across the codebase. There was no rate limiting to protect upstream dependencies. Token cost tracking was absent, making budgeting reactive rather than proactive. Observability was fragmented, making incident investigation difficult.

If OpenAI throttled requests, our API would fail immediately. If latency spiked, request threads remained blocked longer than expected. If pricing models changed, we had no structured mechanism to monitor impact.

This design violated fundamental distributed systems resiliency principles.


Architectural Goal

Before rewriting anything, we defined explicit engineering requirements.

We needed the AI provider to be abstracted so that business logic remained vendor-agnostic. Failures from external providers could not cascade into the API layer. Token usage needed to be measurable to support cost governance and forecasting. Providers had to be swappable to support future multi-provider strategies. Latency required containment mechanisms. Observability needed to be built into the architecture rather than added later.

In short, AI integration had to be treated as infrastructure, not as a helper library.


High-Level Architecture

Client Request
      │
      ▼
ASP.NET Core API
      │
      ▼
AI Gateway Layer
 ├── Provider Abstraction
 ├── Retry Policy
 ├── Circuit Breaker
 ├── Rate Limiter
 ├── Token Metering
 ├── Structured Logging
 └── Fallback Strategy
      │
      ▼
AI Provider(s)
Enter fullscreen mode Exit fullscreen mode

The key shift was introducing a dedicated AI Gateway layer. This layer became the containment boundary between core business logic and external AI providers. It absorbed volatility, enforced governance, and centralized resilience patterns.


Step 1 — Provider Abstraction

We began by defining a strict contract for AI providers:

public interface IAIProvider
{
    Task<AIResponse> GenerateAsync(AIRequest request, CancellationToken ct);
}
Enter fullscreen mode Exit fullscreen mode

A concrete implementation looked like this:

public class OpenAIProvider : IAIProvider
{
    private readonly HttpClient _client;

    public OpenAIProvider(HttpClient client)
    {
        _client = client;
    }

    public async Task<AIResponse> GenerateAsync(AIRequest request, CancellationToken ct)
    {
        var response = await _client.PostAsJsonAsync(
            "/v1/chat/completions",
            request,
            ct);

        response.EnsureSuccessStatusCode();

        return await response.Content.ReadFromJsonAsync<AIResponse>(cancellationToken: ct);
    }
}
Enter fullscreen mode Exit fullscreen mode

This abstraction decouples business logic from the vendor, enables a multi-provider strategy, improves unit testing, and prevents long-term vendor lock-in. More importantly, it creates a seam in the architecture where resilience policies can be applied consistently.


Step 2 — Resilience with Polly

External AI providers are network dependencies. Network dependencies fail. Therefore, resilience policies must be applied systematically.

We introduced exponential backoff retry policies:

builder.Services.AddHttpClient<IAIProvider, OpenAIProvider>()
    .AddPolicyHandler(
        Policy.Handle<HttpRequestException>()
        .WaitAndRetryAsync(3,
            retry => TimeSpan.FromSeconds(Math.Pow(2, retry)))
    );
Enter fullscreen mode Exit fullscreen mode

We also implemented circuit breakers:

.AddPolicyHandler(
    Policy.Handle<HttpRequestException>()
    .CircuitBreakerAsync(5, TimeSpan.FromSeconds(30))
);
Enter fullscreen mode Exit fullscreen mode

Retries handle transient failures. Circuit breakers prevent cascading system degradation during sustained outages. Without a breaker, retry mechanisms can unintentionally amplify failure by increasing pressure on an already struggling provider.

Together, they create controlled resilience.


Step 3 — Token Cost Governance

AI usage scales cost non-linearly. Without visibility, cost becomes reactive and unpredictable.

We implemented middleware to track token usage centrally:

public class TokenTrackingMiddleware
{
    private readonly RequestDelegate _next;
    private readonly ILogger<TokenTrackingMiddleware> _logger;

    public TokenTrackingMiddleware(RequestDelegate next, ILogger<TokenTrackingMiddleware> logger)
    {
        _next = next;
        _logger = logger;
    }

    public async Task Invoke(HttpContext context)
    {
        await _next(context);

        if (context.Items.TryGetValue("TokenUsage", out var tokens))
        {
            _logger.LogInformation("Token usage: {Tokens}", tokens);
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

This enabled cost forecasting, per-tenant billing capabilities, and anomaly detection. AI usage stopped being opaque and became measurable infrastructure.


Step 4 — Rate Limiting

To prevent AI overload and protect upstream providers, we introduced rate limiting:

builder.Services.AddRateLimiter(options =>
{
    options.AddFixedWindowLimiter("aiPolicy", limiterOptions =>
    {
        limiterOptions.PermitLimit = 100;
        limiterOptions.Window = TimeSpan.FromMinutes(1);
    });
});
Enter fullscreen mode Exit fullscreen mode

Applied at the endpoint level:

[EnableRateLimiting("aiPolicy")]
Enter fullscreen mode Exit fullscreen mode

This provided guardrails that preserved both system stability and provider relationships.


Step 5 — AI Gateway Orchestrator

The gateway orchestrates provider selection and failover logic:

public class AIGateway
{
    private readonly IEnumerable<IAIProvider> _providers;
    private readonly ILogger<AIGateway> _logger;

    public AIGateway(IEnumerable<IAIProvider> providers, ILogger<AIGateway> logger)
    {
        _providers = providers;
        _logger = logger;
    }

    public async Task<AIResponse> GenerateAsync(AIRequest request, CancellationToken ct)
    {
        foreach (var provider in _providers)
        {
            try
            {
                return await provider.GenerateAsync(request, ct);
            }
            catch (Exception ex)
            {
                _logger.LogWarning(ex, "Provider failed. Trying next.");
            }
        }

        throw new Exception("All AI providers unavailable.");
    }
}
Enter fullscreen mode Exit fullscreen mode

This design implements controlled failover and prepares the architecture for future multi-provider routing strategies.


What We Explicitly Rejected

We deliberately avoided patterns that appear convenient but degrade system integrity over time. We rejected direct SDK calls in controllers, blocking calls such as .Result, static singletons without resilience policies, hard-coded API keys, and ignoring cancellation tokens.

These approaches are acceptable in prototypes. They are not acceptable in production-grade distributed systems.


Operational Impact

Within three months of implementing the AI Gateway architecture, we observed measurable improvements. AI-related failures dropped by 42%. Token costs were reduced by 31% due to better visibility and governance. There were zero cascading API outages during provider incidents. Response latency stabilized within SLA thresholds.

Most importantly, AI became predictable infrastructure rather than a systemic risk.


Final Thoughts

External dependencies will fail. Latency will spike. Costs will fluctuate. These realities do not disappear with better SDKs or cleaner controller methods.

Architecture does not eliminate volatility. It contains it.

By introducing an AI gateway layer with abstraction, resilience policies, rate limiting, and observability, we transformed AI from an unstable integration into governed infrastructure. The difference is not merely technical elegance — it is operational confidence.

Treat AI as infrastructure, not as a feature, and your systems will scale with clarity rather than uncertainty.

Top comments (0)