The Lie of "Kafka is Up": Operational Realities at Scale

#kafka #datastream #apache #2026

If you ask a junior engineer if the Kafka cluster is healthy, they will check if the PID is running and port 9092 is listening. If you ask a senior engineer, they will ask you about the ISR shrink rate and the 99th percentile produce latency.

Running Apache Kafka in a Docker container on your laptop is a lie. It tricks you into thinking Kafka is simple. In production, Kafka is a beast that rarely dies a loud, dramatic death. Instead, it suffers from "grey failures"—it stays "up," but it becomes slow, unreliable, or dangerous.

This post is about those grey failures. It’s about the difference between a cluster that is running and a cluster that is actually working.

The "Soft Failure" Modes
In production, you will rarely see a hard crash where a broker just exits. The JVM is robust. What you will see are soft failures that degrade your pipeline silently until data loss occurs or downstream consumers starve.

The Rebalance Storm This is the most common "silent killer" of throughput. If your consumer group is unstable—perhaps due to a heartbeat timeout or a long GC pause in the consumer application—the group coordinator triggers a rebalance.

During a rebalance, consumption stops. If you have a "thundering herd" scenario where consumers flap (connect/disconnect/connect), your cluster spends 100% of its time rebalancing and 0% of its time processing messages. The dashboard says "Green," but throughput is zero.

ISR Shrink & Data Risk The "In-Sync Replica" (ISR) list is your safety net. If you have replication.factor=3, you expect 3 copies. But if network jitter causes two followers to fall behind, the leader shrinks the ISR to just itself (1). The cluster is still "up." You can still write to it (if min.insync.replicas=1, which is a terrible default). But you are now running a distributed system as a single point of failure. One disk failure on that leader, and the data is gone forever.

Architectural Foot-Guns
The Over-Partitioning Trap
"More partitions = more concurrency," right? Theoretically, yes. Operationally, no.
Each partition is a file directory on the disk and an overhead on the Controller. I’ve seen teams spin up 50 partitions for a topic with 10 messages a second "just in case."
The cost:

Controller Recovery: If a broker fails, the Controller must elect new leaders for thousands of partitions. This takes time. During that election window, those partitions are unavailable.

Open File Limits: Linux has limits. Kafka hits them.

The Wrong Threading Model
If you are writing a custom consumer in Java/Go/Python, do not perform heavy blocking processing (like DB writes or HTTP calls) in the poll() loop.
If your processing takes longer than max.poll.interval.ms, the broker assumes you are dead, kicks you out of the group, and triggers a rebalance (see above).
The Fix: Decouple polling from processing using internal queues or worker threads, but handle offset commits carefully to avoid "at-most-once" delivery on crashes.

Performance Ceilings: Where Kafka actually chokes
Kafka is rarely CPU bound (unless you use heavy compression like Zstd or SSL encryption). The bottlenecks usually lie elsewhere:

The Page Cache (RAM): Kafka relies heavily on the OS page cache. If your consumers are fast, they read from RAM (cache hits). If they fall behind (lag), they read from Disk (cache miss).

The Death Spiral: Lagging consumers force disk reads -> Disk I/O saturates -> Producers get blocked waiting for disk -> Everyone slows down.

Network Bandwidth: In AWS/Cloud, you have limits. If you saturate the NIC replicating data to followers, the leader can't accept new writes.

Garbage Collection (GC): A massive heap (32GB+) can lead to "Stop-the-World" GC pauses. If the pause > zookeeper.session.timeout.ms, the broker is marked dead by the cluster, triggering massive leader elections, even though the process is fine.

Observability: From Reactive to Proactive
Stop looking at "CPU Usage." It’s a vanity metric for Kafka. Here is the kind of dashboard you actually need to identify an unhealthy cluster before it becomes an outage.

Under Replicated Partitions (URP)
The Golden Signal. If this is > 0, your cluster is unhealthy. It means replicas are falling behind. If this number is stable, you are fine. If it is growing, you are about to lose data.
Request Queue Time
This measures how long a request waits in the broker's queue before being processed.

Low Queue / High Latency: The disk/network is slow.

High Queue / High Latency: The CPU is overloaded.

Consumer Lag: Time vs. Offsets
Monitoring "Offset Lag" (e.g., 10,000 messages behind) is deceptive. 10,000 messages might take 1 second to process or 1 hour.
Monitor "Consumer Lag in Seconds". This tells you the business impact: "Real-time reporting is actually 15 minutes delayed."
Produce P99 Latency
Average latency lies. If your average is 2ms but your P99 is 500ms, your producers are experiencing backpressure. This usually indicates disk saturation or lock contention.

Conclusion: Building for the Bad Day
Reliability in Kafka isn't about preventing failure; it's about surviving it.

Set min.insync.replicas to 2 (with RF=3) to enforce durability, even if it sacrifices availability.

Monitor ISR Churn, not just URP.

Alert on Consumer Group Rebalance Rate.

Kafka is a powerful engine, but don't confuse the engine running with the car moving. Check your dashboards, look for the grey failures, and respect the operational limits.

DEV Community

The Lie of "Kafka is Up": Operational Realities at Scale

Top comments (0)