DEV Community

SoftwareDevs mvpfactory.io
SoftwareDevs mvpfactory.io

Posted on • Originally published at mvpfactory.io

Deterministic Replay Testing for Event-Driven Microservices

---
title: "Deterministic Replay Testing for Kafka Microservices"
published: true
description: "Build a replay test harness that catches breaking schema changes in Kafka consumers before deployment using production stream snapshots and output hashing."
tags: architecture, kotlin, devops, testing
canonical_url: https://blog.mvp-factory.com/deterministic-replay-testing-kafka-microservices
---

## What We're Building

Let me show you a pattern I use in every project that runs Kafka consumers in production: a deterministic replay test harness. By the end of this tutorial, you'll understand how to capture real production Kafka streams, replay them against new consumer versions with schema evolution, and use content-addressable output hashing to catch semantic regressions in CI — before they ship as silent data corruption.

The most dangerous bugs in event-driven systems don't throw exceptions. They're the ones where a consumer silently produces *different* output after a schema change: a renamed field, a changed default, a new enum variant swallowed by a fallback branch. Unit tests won't catch these. Integration tests with synthetic data won't catch these. You need production traffic replay with deterministic output comparison.

## Prerequisites

- Kotlin project with Kafka consumers (Spring Kafka, kafka-clients, or similar)
- Testcontainers set up in your CI environment
- A content-addressable blob store (S3 or GCS) for snapshots
- A Confluent Schema Registry (or compatible) in production

## Step 1: Define the Snapshot Format

Each snapshot is a self-contained replay unit. Here is the minimal setup to get this working:

Enter fullscreen mode Exit fullscreen mode


kotlin
data class TopicSnapshot(
val topicName: String,
val partitionSnapshots: List,
val schemaManifest: Map,
val capturedAt: Instant,
val contentHash: String
)

data class PartitionSnapshot(
val partition: Int,
val records: List,
val startOffset: Long,
val endOffset: Long
)

data class CapturedRecord(
val key: ByteArray,
val value: ByteArray,
val headers: Map,
val timestamp: Long,
val offset: Long
)


The docs don't mention this, but headers and timestamps *must* be preserved. Many consumers branch on header metadata or use event-time semantics. Strip those and your replay tells you nothing.

## Step 2: Build the Replay Engine with Time Control

Determinism means controlling three things: record order, wall-clock time, and external dependencies. Record order is straightforward — replay partition-by-partition at captured offsets. Time is harder. For windowed aggregations, you need a `TimeProvider` abstraction injected into your consumer:

Enter fullscreen mode Exit fullscreen mode


kotlin
interface TimeProvider {
fun now(): Instant
}

class ReplayTimeProvider(
private val records: Iterator
) : TimeProvider {
private var current: Instant = Instant.EPOCH

fun advanceTo(record: CapturedRecord) {
    current = Instant.ofEpochMilli(record.timestamp)
}

override fun now(): Instant = current
Enter fullscreen mode Exit fullscreen mode

}


For windowed aggregations (tumbling, hopping, session windows), the replay engine advances time only when the next record's timestamp crosses a window boundary. This guarantees that window triggers fire identically to production.

External dependencies — database lookups, API calls — get replaced with captured responses stored in the snapshot manifest. A snapshot with externalized dependency responses adds roughly 15–30% storage overhead, but that's a worthwhile trade. External state is the single largest source of replay non-determinism, and you want it gone.

## Step 3: Hash Every Side-Effect

After replay completes, every side-effect the consumer produced gets hashed:

| Output type | Hashing strategy |
|---|---|
| Produced records (to output topics) | SHA-256 of ordered (key, value, headers) tuples |
| Database writes | Deterministic serialization of SQL statements + params |
| HTTP calls (captured) | SHA-256 of (method, path, body) tuples |
| State store snapshots | Sorted key-value iteration hash |

The composite hash becomes the baseline fingerprint for that consumer version against that snapshot. Store it as a CI artifact.

## Step 4: Wire It Into CI as a Hard Gate

Enter fullscreen mode Exit fullscreen mode


yaml
replay-regression-test:
stage: verify
script:
- ./replay-harness run \
--snapshot s3://snapshots/orders-topic/2026-06-10 \
--consumer-image $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA \
--baseline-hash $(cat baseline/orders-consumer.sha256)
rules:
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
allow_failure: false


When the output hash diverges, the harness produces a structured diff showing exactly which records produced different output, which fields changed, and the schema versions involved. Engineers review the diff and either update the baseline (intentional change) or fix the regression.

## Step 5: Handle Schema Evolution During Replay

The replay engine resolves schemas through the mirrored registry at capture time, not the live registry. When testing a new consumer version that expects Schema v3 against a snapshot captured under Schema v2, the harness applies the registry's compatibility rules (BACKWARD, FORWARD, FULL) to verify the evolution is legal, then lets Avro/Protobuf deserialization handle the actual field resolution.

This is where most breaking changes surface. A field marked `optional` in v3 that was `required` in v2 suddenly gets null values during replay that never existed in production — yet. This class of bug is genuinely unnerving because it means your schema registry says "compatible" while your consumer logic says "crash" (or worse, "silently wrong").

## Gotchas

Here's the gotcha that will save you hours:

- **Don't mock Kafka.** Capture real streams and replay them deterministically. Synthetic data misses the edge cases that actually break things in production.
- **Abstract time from the start.** If your consumers call `Instant.now()` or `System.currentTimeMillis()` directly, deterministic replay of windowed aggregations is impossible. Retrofitting it later is painful — it touches more code than you'd expect.
- **Advisory-only replay tests get ignored.** Make baseline hash comparison a deploy gate, not a report. Wire the output hash comparison into your CI pipeline as a hard block on merge, with an explicit approval workflow for intentional baseline updates. If it doesn't block the build, it doesn't exist.
- **Budget 5–10GB per critical topic snapshot and rotate weekly.** Content-addressable snapshots with full header and timestamp metadata are the only reliable replay source.

## Wrapping Up

You now have the architecture for a replay test harness that catches the bugs no other testing approach will find — silent semantic regressions from schema evolution. The core components are the stream snapshotter, schema registry mirror, replay engine with controlled time, output hasher, and baseline comparator. Wire them together, make the CI gate non-negotiable, and you'll stop shipping data corruption disguised as successful deploys.

For deeper reading, check out the [Confluent Schema Registry docs](https://docs.confluent.io/platform/current/schema-registry/index.html) and [Testcontainers for Kafka](https://testcontainers.com/modules/kafka/).
Enter fullscreen mode Exit fullscreen mode

Top comments (0)