Sergei Parfenov

Posted on Jun 2

Your AI Agent Isn't Failing Because It Hallucinates — It's Failing Because of Rate Limits

#ai #machinelearning #llm #devops

Correctness trade-offs in capacity engineering

When my agents started failing in production, I did what everyone does first: I went hunting for hallucinations. Better prompts, tighter output schemas, more guardrails. None of it moved the needle, because I was debugging the wrong layer. The agent's reasoning was fine. It was the plumbing that kept collapsing — and the single biggest culprit was the most boring thing imaginable: rate limits.

This turns out not to be just my problem. It's the dominant production failure mode for LLM applications right now, and almost nobody talks about it because it doesn't make for a good demo.

TL;DR — In production, the thing that takes your agent down usually isn't bad reasoning — it's capacity. Provider rate limits are now one of the largest sources of LLM call errors in real traces. A demo makes one request at a time; a production agent fans out into dozens of chained, retrying, concurrent calls and slams into limits the demo never touched. The fix isn't a smarter model, it's capacity engineering: budgeting, backpressure, retries with jitter, fallback models, and caching.

The data nobody puts in the pitch deck

Here's the number that reframed how I think about agent reliability. In Datadog's analysis of real LLM observability traces, rate-limit errors were a huge share of all LLM call failures — in March 2026, roughly a third of all LLM span errors were rate limits, on the order of millions of individual errors. Their conclusion was blunt: when the dominant failure mode of your LLM application is capacity, you need to redouble your capacity engineering, not your prompt engineering.

Sit with that. The failure mode isn't the model being dumb. It's the model provider saying "too many requests" — and your agent having no plan for that answer.

It maps almost perfectly onto the broader "agents fail in production" story everyone's writing about. The reason demos lie isn't malice; it's structural. A demo runs one clean request, one user, one happy path. Production is concurrency, retries, fan-out, and load — the exact conditions that manufacture rate-limit errors. The gap between "works in a notebook" and "works at 3am under load" is, more often than people admit, a capacity gap wearing a reliability costume.

Why agents hit this wall harder than chatbots

A plain chatbot makes one API call per user turn. An agent is a different beast. A single "task" expands into:

A planning call.
N tool-selection calls as it loops.
A call per tool result to decide the next step.
Retries on each of those when something is flaky.
Often a sub-agent or two, each with its own loop.

So one user action becomes 10–40 model calls, frequently concurrent, frequently retrying. The multiplier is the whole point of agents — and it's also exactly what walks you into a rate limit. Worse, the naive failure response makes it catastrophic: a call gets a 429, the framework retries immediately, that retry also gets a 429, and now you've turned one rate-limit error into a retry storm that takes the whole task down.

The arithmetic is unforgiving once you write it out. Say your provider gives you 500 requests/minute. If each agent task fans out to ~20 model calls, then just 25 concurrent tasks saturate your entire quota — and that's before a single retry. Add naive immediate retries on the resulting 429s and you don't degrade gracefully, you spike straight through the ceiling. I've watched this pattern play out more than once, and every time the first instinct in the room is "the model is broken" — when the model never even ran.

This is also where serverless bites you specifically. On Cloud Run, a traffic spike spins up new instances happily — compute scales fine. But your LLM provider quota does not scale with your container count. So autoscaling does the worst possible thing: it lets more concurrent agents launch, each firing its call fan-out, all drawing from the same fixed provider quota, all hitting the ceiling at once. The platform that's supposed to absorb load becomes the thing that amplifies it into the rate limiter. It's a genuinely counterintuitive failure: the healthier your autoscaling looks on the compute dashboard, the harder you're hammering a quota that can't scale with it.

The capacity-engineering toolkit

None of the fixes are exotic. They're the same patterns distributed-systems people have used for decades — they just haven't migrated into most agent codebases yet, because the field grew up on prompt-craft, not ops. Here's what actually moved my reliability numbers.

1. Budget and backpressure, don't just retry

The instinct is to retry harder. The fix is to send less. Put a concurrency limiter (a semaphore / token bucket) in front of all outbound model calls so your app never exceeds your known provider quota in the first place. When the budget is full, queue — don't fire-and-retry. This single change does more than any retry tuning, because it prevents the storm instead of recovering from it.

import asyncio

# Cap concurrent in-flight calls below your provider's actual limit.
# Leave headroom — you are NOT the only caller against this quota.
sem = asyncio.Semaphore(8)

async def call_model(client, **kwargs):
    async with sem:
        return await client.messages.create(**kwargs)

2. Retry with exponential backoff and jitter

When you do retry, never retry immediately, and never retry in lockstep. Synchronized retries from many workers create a thundering herd that re-triggers the limit. Exponential backoff with random jitter spreads them out.

import asyncio, random

async def with_backoff(fn, max_retries=5, base=0.5):
    for attempt in range(max_retries):
        try:
            return await fn()
        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            # exponential + full jitter
            delay = random.uniform(0, base * (2 ** attempt))
            await asyncio.sleep(delay)

Respect the Retry-After header if the provider sends one — it's telling you exactly how long to wait, which beats guessing.

3. Fallback model, not just failure

Tie this back to distillation thinking: you don't need your frontier model for every call. Route to a cheaper/secondary model (a different provider, or a smaller model on a separate quota) when the primary is rate-limited. A degraded answer beats a dead task, and you've spread load across two quota pools instead of hammering one. This is the same hybrid pattern as keeping a cheap student model for the easy 90% and falling back to an expensive teacher — just applied to availability instead of capability.

4. Cache aggressively

A surprising fraction of agent calls are near-duplicate: the same tool descriptions, the same system context, the same sub-queries across runs. Prompt/response caching and reusing provider-side prompt caching cuts the call volume that reaches the limiter at all. The cheapest rate-limit error is the request you never sent.

5. Make capacity observable

You can't engineer what you can't see. The reason rate limits blindside teams is that they show up as generic "agent failed" errors, not as a labeled capacity problem. Log the error class (429 vs timeout vs tool error), track your in-flight concurrency and your 429-rate as first-class metrics, and alert on them. The shift that mattered most for me was simply separating "the model was wrong" from "the provider said no" in the telemetry — until you do that, every failure looks like a reasoning bug, and you keep fixing the wrong layer.

The mental model shift

The thing I'd tell my past self: treat your LLM provider quota as a shared, finite, non-scaling resource — like a database connection pool, not like CPU. Compute scales elastically. Your token-per-minute and request-per-minute quotas do not. Once you internalize that, agent reliability stops looking like an AI problem and starts looking like a classic distributed-systems capacity problem — which is great news, because we already know how to solve those.

Smarter models won't save you here. A GPT-6 that reasons perfectly still returns 429 when you exceed your quota. The reliability frontier for agents in 2026 isn't intelligence — it's capacity engineering.

If you're running agents in production, I'm curious what your dominant failure mode actually is when you separate the error classes — reasoning, capacity, or tool integration? My money's increasingly on capacity. Tell me I'm wrong in the comments.

Sources & further reading

Datadog, "State of AI Engineering" (2026) — rate-limit errors as a dominant share of LLM call failures in production traces.
"Why AI Agents Fail in Production and How Engineering Teams Are Fixing It", C# Corner (2026).
"The AI Agent Reliability Gap in 2026", DEV Community.
"Why 88% of AI Agents Never Reach Production", Digital Applied (2026).

Top comments (17)

xulingfeng • Jun 2

Sergei, the line about debugging hallucinations when the real culprit is API quota hit way too close to home. We run Hermes agents hitting DeepSeek V4 Flash API daily — about 95% of prompts get cache-hitted, but that 5% miss rate combined with concurrent fan-out runs straight into 429s. We fell into the exact same naive retry storm: one 429 became five concurrent retries, eating the entire quota to zero. Fixed it with de-correlated jitter + exponential backoff and it’s been stable since.

The serverless + LLM quota mismatch observation is spot on — auto-scaling spins up instances fine but your API quota doesn’t auto-scale with it. That arithmetic example (25 concurrent tasks saturates 500 req/min) is brutal. Saving that one for architecture reviews.

Sergei Parfenov • Jun 2

ha, "too close to home" is the whole reason i wrote it — spent way too long blaming the model before i looked at the error class.

the de-correlated jitter fix is the right call. one thing worth poking at in ur setup: that 5% miss rate is probably lying to u. cache misses arent spread evenly across the day — they cluster. new context, novel inputs, a deploy that shifts prompts, and suddenly ur missing way more than 5% for a few min straight. so the dangerous moment isnt "5% of traffic," its the burst where ur miss rate spikes AND fan-out is high at the same time. thats when u eat the quota. the average hides it completely — u gotta look at the p99 of concurrent live calls, not the mean.
the thing that helped me most on top of backoff was a hard concurrency cap (semaphore) in front of all outbound calls, sized below the actual quota with headroom. backoff recovers from the storm, but the cap stops u from ever launching enough concurrent calls to start one. belt and suspenders.

also since ur already on DeepSeek V4 Flash as the workhorse — having a second cheap model on a separate quota as a fallback for the 429 cases basically doubles ur effective ceiling for free. same hybrid trick as keeping a cheap student + expensive teacher, just for availability instead of capability.
good war story tho, the one-429-becomes-five detail is exactly the part nobody sees coming.

ANP2 Network • Jun 2

Good reframe, and the capacity-engineering fixes are right — but each one quietly opens a correctness hole while it closes the availability one. The 429 is the loud failure: you see it, you alert on it. Retries-with-jitter, fallback models, and caching keep the agent alive, but they also let it act on output it didn't freshly earn. A cache hit can be stale for this input, a fallback model answers differently than the primary, and a retry on a non-idempotent call re-runs the side effect. You've traded a loud failure (rate limit) for a quiet one — acting on degraded or stale state without noticing.

So the capacity layer has to be correctness-aware, not just availability-aware: a cache entry that knows whether it's still valid for the input, a fallback whose answer is tagged lower-trust and re-checked before anything irreversible, retries gated by idempotency keys. Otherwise the reliability you bought is uptime, not correct uptime — the agent stays up and is confidently wrong, which is exactly the failure mode the hallucination-hunters were worried about in the first place, just arriving through the plumbing instead of the model.

Sergei Parfenov • Jun 3

yeah, ur completely right, and this is the part the post undersold. i framed the whole thing as an availability problem and basically waved at correctness — but every fix i listed buys uptime by acting on output that wasnt freshly earned. "uptime, not correct uptime" is a better one-liner than anything in the actual article lol. the loud-failure-traded-for-quiet-one framing is exactly it: a 429 is honest, a stale cache hit lies to u.

ur three fixes are the right shape — id frame them as: the capacity layer cant just answer "can i serve this," it has to answer "can i serve this and still trust the result." cache entry that knows if its still valid, fallback tagged lower-trust, retries gated by idempotency keys. all per-call correctness. agreed on all three.

the one id add sits a layer above urs, because agents make it worse than single calls: trust has to propagate across the chain, not just per call. say step 3 of a 6-step task comes from a lower-trust fallback. steps 4-6 can each be individually "correct" and still be poisoned, because they reasoned on top of a degraded input. so the lower-trust tag cant stay local to the call that produced it — it has to taint everything downstream of it. then the idempotency/irreversible-action gate u described checks the aggregate trust of the whole trajectory, not just the last hop. otherwise u catch the degraded fallback right up until the one step where it laundered itself through two "clean" calls and came out looking trustworthy.

which is a longer way of agreeing with ur core point: availability-aware is the easy 80%, correctness-aware is the part that actually decides whether the reliability is real. that probably deserves its own post tbh — "correct uptime" might be the better frame than the rate-limit one i led with. mind if i credit this thread if i write it?

ANP2 Network • Jun 3

That taint-propagation point is the real one — and the thing that makes it hold is forcing trust to be monotonic along the chain: a step can carry or lower the trust of its inputs, never raise it. The laundering you describe only happens when a "clean" call is allowed to re-attest its output at full trust regardless of what it consumed. If every step's output trust is floored at min(its own, the lowest input it read), and that floor is bound to the data lineage rather than a label the step recomputes, the degraded step-3 can't get washed clean by steps 4-5 — the floor follows the data, so the irreversible-action gate at step 6 sees the min over the whole trajectory no matter how many clean hops sit in between. Trust that can only ratchet down is the line between provenance and a vibe. And yeah, credit away — glad the thread was useful; "correct uptime" is the right frame to lead with.

Valentin Monteiro • Jun 4

Rate limits aren't operational friction, they're architectural feedback. When your agent hits 429s consistently, the system is telling you it was designed assuming infinite API availability. The real fix isn't retry logic. It's designing for scarcity from the start.

ANP2 Network • Jun 4

Strongly agree, and the word doing the most work there is "consistently." A burst 429 is genuinely transient — retry is fine. A consistent one is the architecture telling you demand structurally exceeds grant, and retrying is just arguing with it politely. The tell is whether waiting changes anything; if the limit is a rate and not a blip, patience is a no-op dressed up as a strategy.

Where "design for scarcity from the start" gets real is making the budget a planning input, not a call-site check. A lot of "scarcity-aware" code still discovers the limit at the moment of the call and bounces — the agent had no idea it was poor until it tried to spend. The version that holds is the budget being visible to whatever decides what to do next, so a scarce call gets spent on the high-value step and a cheap-or-skip path is taken when it's low, before the request goes out. Scarcity should shape the plan, not interrupt it.

One thing worth adding: the 429 is one of the few signals in the loop the agent can't author. Most of what an agent "knows" about its own state it wrote itself; the rate limit comes from outside and can't be wished away. Treating it as friction throws away the one piece of un-fakeable feedback the environment hands you for free.

Valentin Monteiro • Jun 6

Budget as a planning input hits the core issue. Most teams treat scarcity as an error to handle at the call site instead of a constraint to plan around. By then the agent already committed to the expensive path and the 429 is just the environment telling you the decision was wrong three steps ago.

ANP2 Network • Jun 6

Exactly — and "three steps ago" is really an observability gap: the cost was knowable at step 0, it just wasn't where the decision got made. The budget lives in the HTTP layer (the 429), not in the planner's world model, so every agent rediscovers scarcity reactively at the call site.

Planning around it means lifting the constraint up to where paths get chosen — remaining quota/spend as observable state the planner reads before committing, the way you'd plan around a battery rather than a thrown exception. Worth keeping two signals distinct there, since "429" hides both: a per-call rate limit is a scheduling problem (pace or queue the same path), a budget is an allocation problem (is this path worth the spend at all). Collapse them and you retry your way through a budget you should have planned out of.

Mykola Kondratiuk • Jun 4

spent two weeks on prompt tightening before I realized it was exponential retry on a timeout - each failure doubled the call volume into the rate ceiling. reasoning was the wrong place to look.

Mudassir Khan • Jun 9

the 'debugging the wrong layer' framing is exactly what eats the first week of prod debugging. had the same experience: prompt engineering pass, then schema tightening pass, then realized the 429s were silent retries inside the SDK and the timeout was masking them as reasoning failures.

the 'platform absorbs load, amplifies rate limit hits' observation is the part most writeups skip. we added a semaphore at the application layer to cap concurrent LLM calls per container. compute fans out, LLM call rate stays bounded. dropped our 429 rate from 12% to 1% without touching quotas.

have you tried multiprovider fallback at the gateway layer? tools like Bifrost weight order providers so a 429 reroutes instead of erroring — changes it from a hard failure to graceful degradation.

xulingfeng • Jun 2

ha, "too close to home" is exactly right — spent way too long staring at model outputs before checking the error class.

The p99 vs mean point on cache misses is a good callout. We track p50/p95/p99 on API latency but never thought to do the same for concurrent live calls. Going to add that. And the semaphore cap before backoff — belt and suspenders — makes more sense the more I think about it. Our current approach is purely reactive (retry with backoff), having a hard cap would prevent the storm from starting in the first place.

The second cheap model on separate quota as 429 fallback is smart. We have qwen2.5:7b locally on the same GPU — it's on a different rate limit bucket so it'd serve exactly that role. Need to wire it up as a real fallback instead of just a parallel worker.

arun rajkumar • Jun 8

This lands hard from the payments side, where we've lived the non-idempotent-retry problem long before agents existed. A 429 storm on a stateless read just wastes quota; a naive retry on a call that moves money double-charges someone. So for us, capacity engineering and idempotency were never separate disciplines — every outbound call carries an idempotency key, and the semaphore-in-front-of-the-quota pattern you describe is exactly what payment rails have enforced for years. The reframe I'd offer: agents aren't hitting a new class of problem, they're rediscovering the capacity + exactly-once semantics boring infra already solved. The open question is whether the agent frameworks bake that in or make every team relearn it at 3am. When you split your error classes, does the non-idempotent side-effect case show up separately from raw capacity, or still hide inside it?

Dan • Jun 8

The "one user action becomes 10 to 40 model calls" point is the part people underestimate. A single prompt looks cheap in a demo, then the real product adds tool calls, retries, background jobs, summarization, logging, and suddenly the math is completely different.

Your semaphore caps concurrency globally, which is what stops the storm. The piece I'd add above it is admission control per task: before the agent fans out, decide whether the whole task can even afford to run.

If a task might consume 20 calls, I don't want to discover halfway through that the account, provider, or billing plan can't support it. I'd rather reserve a task budget up front with an idempotency key, then decrement against that budget as calls happen. If the task fails, release or reconcile the unused portion.

That gives you a cleaner boundary between infra rate limits and product limits:

provider quota says what the system can physically do
account quota says what this customer is allowed to do
task budget says what this run is allowed to spend
ledger entries explain what actually happened later

Without that split, 429s become a weird mix of infra failure, billing bug, and bad UX.

Abdullah Shahin • Jun 3

The asymmetry between compute autoscaling and quota scaling is the part that bit me hardest in practice — a Lambda-style runtime will happily fan out a hundred workers, each one of which thinks it owns the full RPM budget. The pattern that actually held up was moving the rate limit out of the workers and into a shared token-bucket process (a Redis-backed bucket with a lua refill, but a sidecar would work too), so concurrency is bounded by tokens-in-bucket rather than by how many warm containers happen to exist. One thing not mentioned that's worth flagging: tokens-per-minute usually saturates before requests-per-minute on long-context agents, and TPM exhaustion returns the same 429 with no separate header on some providers — so a retry policy keyed only on RPM headroom will retry-storm right back into the wall. The other subtle one is that fallback-to-cheaper-model only helps if the fallback isn't on the same org-level quota; on a couple of providers all models share a pooled token budget per tier, so the "fallback model" is fiction under load.

Echo • Jun 2

This is the post I wish had existed two years ago when I was debugging the same failure on a smaller scale. Two things I'd add from running a similar setup in anger:

The "jitter" advice is technically correct but operationally underrated. Plain exponential backoff without a wide jitter window still produces thundering-herd waves when a provider's rate-limit window rolls over. A practical rule of thumb: jitter window >= average request interval, otherwise you're just decorrelating a correlated wave. We have a "jitter smoke test" that replays 24h of trace traffic at 5x and watches the retry distribution — if it clusters, the jitter is too narrow.
Fallback models are deceptively cheap. People skip them because they assume a "worse" model will quietly degrade quality. In practice, when you trip the fallback only on rate-limit errors (not on bad outputs), the failure mode is latency not quality — the fallback just buys you time. Quality regressions only show up if you start falling back on correctness errors, which is a different decision.

The "demo vs production" framing in the TL;DR is the real unlock. Most agent reliability advice assumes the failure is on the reasoning layer because that's what demos fail at.

Sergei Parfenov • Jun 3

the "jitter window >= average request interval" rule is the kind of thing that should be in every retry tutorial and somehow never is — saved. the smoke test is even better: replaying trace traffic at 5x and watching the retry distribution is exactly the move, because narrow jitter passes every unit test and only shows up as clustering under load. most people only discover their jitter is too narrow during the actual incident. stealing that.

on the fallback point — yes, and the "trip it only on rate-limit errors, not bad outputs" distinction is the whole game. that one line is the difference between fallback-as-latency-tradeoff and fallback-as-quality-russian-roulette. people conflate the two and then conclude fallbacks are dangerous, when really they just wired the trigger wrong.

worth connecting to something another commenter (ANP2) raised on this same post though: even when u trip fallback only on 429s, the fallback's answer still wasnt produced by ur primary, so anything irreversible downstream should treat it as lower-trust until re-checked. so its latency-not-quality for the availability decision, exactly like u said — but the moment that fallback output feeds an irreversible action, it quietly becomes a correctness decision again. two different gates: "can i serve" (trip on 429, latency tradeoff, ur point) and "can i act on this irreversibly" (check trust, ANP2's point). keep them separate and both of u are right.
and yeah — the reason demo-vs-prod is the unlock is that demos only ever exercise the reasoning layer, so thats the only failure anyone learns to look for. the entire ops layer is invisible until u have load. appreciate the in-anger notes, this is the good stuff.

View full discussion (17 comments)