Apoorv Tyagi

Posted on Feb 11 • Originally published at aws.plainenglish.io

Kafka Consumer Container Restarts in Kubernetes: A Production Case Study

#aws #kubernetes #kafka #computerscience

Background

At my current org, I got introduced to a very fun tool, PagerDuty - because who doesn't love being woken up at 3:31 AM?

A similar alert flared up for one of our merchant services. While it mercifully avoided the 3 AM slot, the diagnostic was clear: The containers running were restarting. There’s no bad deployment, just a burst of events on a Kafka topic and a trail of exit code 137.

This is the story of how we tracked down the memory issue in one of our core merchant services at PayPay that wasn't exactly a leak, but a partition mismatch, and a battle between Garbage Collectors.

The Incident: When Bursts Turn Into Crashes

It started with a Kafka topic.

This topic powers merchant payouts in Japan and, on average, processes around 3,500 messages per day. On paper, that volume sounds harmless. In reality, the traffic pattern was anything but smooth. Instead of being evenly distributed, most of those messages arrived in short, aggressive bursts.

And that’s when things began to fall apart.

Every time a burst hit, our merchant service pods would suddenly spike in memory usage. Within minutes, they would hiy their memory limit and get terminated by Kubernetes with an OOMKill.

The Investigation

When we looked at the consumer topology, the first red flag was almost too obvious to ignore:

Kafka partitions: 6
Active consumer pods: 10

In Kafka, this setup has a very specific behavior. A single partition can be consumed by only one consumer within a consumer group. So with 6 partitions and 10 pods, 4 pods were effectively idle, waiting on the sidelines.

Every burst of messages was being funneled into just 6 pods, concentrating the entire load on 60% of our fleet, while the remaining pods contributed nothing. Under normal traffic, this imbalance went unnoticed. But during bursts, those active pods suddenly did significantly more work than they were sized for.

That explained where the pressure was landing, but not why it was causing hard crashes.

To answer that, we moved to the next step: capturing a heap & thread dump in our staging environment while simulating a 5X production load. And that’s where things started getting really interesting.

The ZGC Struggle

We were initially running the service on ZGC.

On paper, ZGC is impressive - microsecond-level pause times and near-zero stop-the-world events. But what often gets missed is its biggest trade-off: ZGC needs a lot of breathing room. It relies heavily on having heap memoryto keep allocation and reclamation in balance.

When we profiled the system, the numbers told a worrying story. The old-generation worker alone had consumed over 584 seconds of CPU time. ZGC wasn’t idle - it was working relentlessly, trying to keep up with a flood of newly allocated objects.

But it simply couldn’t reclaim memory fast enough.

The heap kept expanding, allocation pressure kept rising, and eventually the pod crossed its memory limit and was OOMKilled. Low pause times didn’t matter if the process couldn’t stay alive long enough to benefit from them.

The Substitute: G1GC

At this point, we decided to switch strategies and flipped the JVM to G1GC, then reran the same load test.

The difference was immediate:

Memory usage: Stabilized under 1 GiB (previously went up all the way to 3 GiB)
Pause times: Increased slightly to around 35 ms, still comfortably within our SLAs
Stability: No crashes. No restarts.

G1GC traded a bit of pause time for predictable memory behavior, which turned out to be exactly what this workload needed. In a burst-heavy system, consistency beats theoretical best-case latency every single time.

The Side-Quest: The Slack Notifier "Leak"

While digging through the thread dumps, we found a "Red Herring." Our SlackNotifier class, which is responsible for sending failure alerts, was instantiating a new OkHttpClient for every single error notification.

Each OkHttpClient spins up its own connection pool and thread pool. In a high-error scenario, this could have quietly expanded into dozens, if not hundreds, of threads, steadily pushing the JVM toward thread exhaustion and memory pressure.

Interestingly, this wasn’t the direct cause of the production crashes. During the Kafka bursts, no errors were actually being emitted, which meant this code path wasn’t even executed. But left unfixed, it was a ticking time bomb waiting for the next incident.

The fix was straightforward: refactor the notifier to use a single, shared OkHttpClient singleton, ensuring controlled resource usage and predictable behavior.

Not the root cause, but a critical cleanup uncovered at exactly the right time.

Final Fixes

To stabilize Merchant Finance, we didn’t just tweak application code—we re-aligned the infrastructure to match the workload it was actually handling.

Here’s what we changed.

1. Memory Alignment

We set memory.request to match memory.limit at 3 GiB.

This ensured the node reserved the full memory upfront, eliminating eviction pressure during bursts and preventing Kubernetes from stepping in with an OOMKill when memory usage spiked. [A reddit thread worth exploring]

2. Partition Scaling

We scaled the Kafka topic from 6 partitions to 10.

This change ensured that every merchant service pod became an active consumer, evenly distributing the workload instead of concentrating it on a subset of the fleet. Burst traffic stopped being a stress test for a few pods and became a shared responsibility across all of them.

3. The GC Switch

We officially moved this service to G1GC.

For this workload profile, we chose predictable memory behavior over ultra-low pause times. Slightly higher pauses were a small price to pay for stability under burst-heavy traffic.

4. CPU Buffering

We increased the CPU request from 100 millicores to 500 millicores.

This wasn’t about average CPU usage; it was about headroom. During peak allocation and garbage collection cycles, G1GC needs consistent CPU availability to do its job efficiently. An under-provisioned CPU would only delay GC work and amplify memory pressure.

5. Fixing SlackNotifier

Finally, we refactored SlackNotifier to use a single shared OkHttpClient instance instead of creating a new one per notification.

It wasn’t the root cause, but it removed a hidden risk that could have turned a future incident into something far worse.

Results

After applying the infrastructure and JVM changes, we reran the same burst-heavy load tests and monitored JVM, Kubernetes, and Kafka metrics.

The difference wasn’t subtle 👇

1. JVM Behavior (ZGC v.s. G1GC)

GC pause frequency dropped significantly
GC avg pause times increase

The change is clearly visible around 12:00–12:10, which is when the GC and infrastructure updates were applied.

2. G1GC Under Load

Switching away from ZGC did increase pause times, but well within acceptable limits.

Average GC pause: ~30–40 ms
Consistency: Flat, predictable, no spikes
Impact: No effect on SLAs

This was an intentional trade-off. We gave up ultra-low pause times in exchange for a lower GC pause count and memory predictability.

3. Kubernetes Pods Stopped Restarting

From a platform perspective, this was the most important outcome.

Pod restarts: 0
OOM events: None
Memory usage: Stable across all pods
CPU usage: Predictable, with sufficient headroom during GC cycles

Container-level metrics after the fixes - memory usage remains stable, CPU has sufficient headroom, and pod restarts drop to zero.

4. Kafka Kept Up With Bursts Without Errors

Finally, we looked at the Kafka side of the system.

Consumer lag: Spiked briefly during bursts, then drained smoothly
Consumer latency (P95): Stable
Errors: None observed
Throughput: Fully sustained during burst windows

Increasing the partition count ensured every pod actively participated in consumption, preventing load concentration and smoothing out processing during spikes.

Kafka consumer metrics during burst traffic. Lag increases temporarily but recovers quickly, with no errors and stable latency.

Final Outcome

After all fixes were applied:

✅ No pod restarts during burst traffic
✅ No OOMKills
✅ Predictable JVM behavior under load
✅ Kafka bursts handled without errors

Resilience isn’t just about handling errors after they happen, it’s about understanding how your system behaves under load.

If you’re running JVM workloads on Kubernetes, it’s worth taking a hard look at both your garbage collector choice and your resource requests versus limits. These decisions don’t show their impact during steady state traffic, but they matter enormously when load arrives in bursts.

Sometimes, the “latest and greatest” option isn’t the right fit for every workload. Stability often comes from choosing the tool that behaves predictably under stress, not the one that looks best on paper.

In production, boring and predictable almost always win.

DEV Community