DEV Community

Cover image for Week 1 — When LLM Failures Weren’t About Load, But Timing (ZooKeeper + Distributed Locking)
Namratha
Namratha

Posted on

Week 1 — When LLM Failures Weren’t About Load, But Timing (ZooKeeper + Distributed Locking)

This post starts a weekly series where I’ll be writing about practical things I’ve learned while working on real systems — the kind of problems that don’t show up in tutorials but show up immediately in production

The idea isn’t to teach concepts from scratch.It’s to document situations where something behaved unexpectedly, what we assumed at first, what actually went wrong, and what finally made the system stable.Each week will focus on one specific issue — backend behavior, distributed coordination, Devops and infra decisions, or AI — explained from the perspective of debugging and reasoning through it.

The Symptom

We had a model that worked perfectly fine most of the time.But randomly, the system would go unstable:sudden throttling,latency spikes,retries increasing the load instead of fixing it and then everything calming down again

The confusing part : our overall request volume was well within limits.So the model wasn’t overloaded.Yet it behaved like it was

What Was Actually Happening

The problem wasn’t how many requests we sent.It was when we sent them Multiple independent AWS clients were calling the same model.Each one behaved correctly on its own, but occasionally they lined up at the same moment and hit the model together.

Think of it like this: The model was fine with steady traffic.But not with sudden synchronized bursts

So instead of: 50 requests spread over time
we were unintentionally creating: 50 requests at the same second
And LLMs really don’t like that

Why Normal Rate Limiting Didn’t Help

Our first instinct was obvious — rate limit it.But typical rate limiting solves a different problem: It limits volume, not simultaneous execution. We could still be under the per-second quota and fail, because all requests arrived together.We tried approaches like:local locks,counters,smoothing through queues.They reduced frequency of failures but didn’t remove them.Because the issue wasn’t counting.It was coordination. We needed the system to agree on who gets to call the model right now

The Shift in Thinking : Instead of treating the model like a normal API…We treated it like a shared critical resource.

Why ZooKeeper ❓

We needed something that could coordinate independent callers reliably.ZooKeeper gave us exactly one property we cared about:A lock that automatically disappears if the caller dies.

  1. No stale locks.
  2. No manual cleanup.
  3. No guessing ownership.

This matters a lot in distributed systems — failures shouldn’t make the system permanently blocked.

The Approach

Before any request could call the model:
Acquire distributed lock -> Call model -> Release lock

Conceptually: Many clients → one controlled entry → model

We didn’t slow the system down.We removed chaos from it.

Using Kazoo (Python)

Create the client:

from kazoo.client import KazooClient 
zk = KazooClient(hosts="zookeeper:2181") 
zk.start()
Enter fullscreen mode Exit fullscreen mode

Create the lock:

from kazoo.recipe.lock import Lock 
lock = Lock(zk, "/llm_model_lock")
Enter fullscreen mode Exit fullscreen mode

Protect the model call:

with lock: response = call_model(payload)
Enter fullscreen mode Exit fullscreen mode

Now every caller competes for the same entry point.ZooKeeper handles ordering and release automatically.

What Changed After This

The interesting part wasn’t speed.It was stability.
We observed:throttling almost disappeared,retry storms stopped happening,latency became predictable,failures became rare instead of clustered,Nothing about the model changed.We just stopped letting everyone talk at once

The Biggest Learning

I originally thought rate limiting was about controlling traffic volume.In distributed AI systems, it’s usually about controlling concurrency.You don’t prevent overload by sending fewer requests.
You prevent overload by controlling simultaneous execution.

Retries fix symptoms.Coordination fixes causes.

LLM integrations often look like: send request → get response

But production behavior depends on what happens around that call.In this case, reliability didn’t come from scaling infrastructure — it came from adding coordination in front of the model.Sometimes stability isn’t about doing things faster.It’s about letting them happen in order.

More posts coming weekly — each one focused on a single real problem and what it taught me.

Top comments (0)