This post starts a weekly series where I’ll be writing about practical things I’ve learned while working on real systems — the kind of problems that don’t show up in tutorials but show up immediately in production
The idea isn’t to teach concepts from scratch.It’s to document situations where something behaved unexpectedly, what we assumed at first, what actually went wrong, and what finally made the system stable.Each week will focus on one specific issue — backend behavior, distributed coordination, Devops and infra decisions, or AI — explained from the perspective of debugging and reasoning through it.
The Symptom
We had a model that worked perfectly fine most of the time.But randomly, the system would go unstable:sudden throttling,latency spikes,retries increasing the load instead of fixing it and then everything calming down again
The confusing part : our overall request volume was well within limits.So the model wasn’t overloaded.Yet it behaved like it was
What Was Actually Happening
The problem wasn’t how many requests we sent.It was when we sent them Multiple independent AWS clients were calling the same model.Each one behaved correctly on its own, but occasionally they lined up at the same moment and hit the model together.
Think of it like this: The model was fine with steady traffic.But not with sudden synchronized bursts
So instead of: 50 requests spread over time
we were unintentionally creating: 50 requests at the same second
And LLMs really don’t like that
Why Normal Rate Limiting Didn’t Help
Our first instinct was obvious — rate limit it.But typical rate limiting solves a different problem: It limits volume, not simultaneous execution. We could still be under the per-second quota and fail, because all requests arrived together.We tried approaches like:local locks,counters,smoothing through queues.They reduced frequency of failures but didn’t remove them.Because the issue wasn’t counting.It was coordination. We needed the system to agree on who gets to call the model right now
The Shift in Thinking : Instead of treating the model like a normal API…We treated it like a shared critical resource.
Why ZooKeeper ❓
We needed something that could coordinate independent callers reliably.ZooKeeper gave us exactly one property we cared about:A lock that automatically disappears if the caller dies.
- No stale locks.
- No manual cleanup.
- No guessing ownership.
This matters a lot in distributed systems — failures shouldn’t make the system permanently blocked.
The Approach
Before any request could call the model:
Acquire distributed lock -> Call model -> Release lock
Conceptually: Many clients → one controlled entry → model
We didn’t slow the system down.We removed chaos from it.
Using Kazoo (Python)
Create the client:
from kazoo.client import KazooClient
zk = KazooClient(hosts="zookeeper:2181")
zk.start()
Create the lock:
from kazoo.recipe.lock import Lock
lock = Lock(zk, "/llm_model_lock")
Protect the model call:
with lock: response = call_model(payload)
Now every caller competes for the same entry point.ZooKeeper handles ordering and release automatically.
What Changed After This
The interesting part wasn’t speed.It was stability.
We observed:throttling almost disappeared,retry storms stopped happening,latency became predictable,failures became rare instead of clustered,Nothing about the model changed.We just stopped letting everyone talk at once
The Biggest Learning
I originally thought rate limiting was about controlling traffic volume.In distributed AI systems, it’s usually about controlling concurrency.You don’t prevent overload by sending fewer requests.
You prevent overload by controlling simultaneous execution.
Retries fix symptoms.Coordination fixes causes.
LLM integrations often look like: send request → get response
But production behavior depends on what happens around that call.In this case, reliability didn’t come from scaling infrastructure — it came from adding coordination in front of the model.Sometimes stability isn’t about doing things faster.It’s about letting them happen in order.
More posts coming weekly — each one focused on a single real problem and what it taught me.

Top comments (0)