Nasit Sony

Posted on May 29

How I Built a KV-Cache Control Plane for LLM Inference — With Real Benchmark Results

#distributedsystems #cpp #mlop #opensource

How I Built a KV-Cache Control Plane for LLM Inference — With Real Benchmark Results

LLM inference is expensive. The prefill step — processing the prompt — is the biggest cost. If you've seen the same prompt before, you shouldn't have to recompute it.

That's the core idea behind KV-cache reuse. But in a distributed system with multiple inference nodes, a new problem emerges: where is the cached prefix stored, and how do you route requests to maximize reuse?

I built llm-serving-cache to answer that question — a metadata-driven control plane for LLM KV-cache placement and routing.

The Problem

In a single-node setup, KV-cache reuse is straightforward. The cache is local and the router is trivial.

In a distributed setup:

Cached prefixes are scattered across nodes
The same prompt might be cached on node-a but the request lands on node-b
Cache misses are expensive — full prefill cost, every time
GPU memory is finite — you need admission control and eviction You need a control plane that knows where every cached prefix lives and routes requests intelligently.

The Architecture

Client Request
      ↓
Router
      ↓
Session Affinity Check   → route to same node if session exists
      ↓
Exact Cache Hit?         → reuse cached result, skip prefill
      ↓
Prefix Match?            → reuse partial computation
      ↓
Cache Miss               → select best node, trigger cache fill
      ↓
[If full] Evict          → remove oldest inactive request
      ↓
Inference + Register     → store new cache entry
      ↓
WAL-backed Metadata Store

Core Components

Router — handles exact hits, prefix matches, session affinity, and cache misses.

Node Registry — tracks available nodes, GPU memory, and utilization.

Metadata Store — persists cache entries and session routes via a WAL-backed KV engine (VeriStore).

Placement Policy — best-fit node selection based on available GPU memory blocks.

Benchmark Results

I ran controlled benchmarks across five cache strategies:

Scenario	Avg Latency	P95 Latency	Hit Rate	Rejection Rate
No Cache	1405 ms	1405 ms	0%	0%
Prefix Reuse	985 ms	1405 ms	50%	0%
Exact Cache	205 ms	205 ms	100%	0%
GPU-Aware	843 ms	1405 ms	25%	25%
GPU-Aware + Eviction	1895 ms	4205 ms	25%	0%

Key observations:

Exact cache reuse reduces latency by ~85% vs no cache
Prefix reuse improves average latency but not tail latency — P95 stays high when misses are still present

- Eviction reduces rejection but increases latency by admitting previously rejected expensive requests

Real Inference Validation (Ollama)

Benchmarks are useful, but I wanted to validate against real inference. I integrated Ollama running Llama 3.1 8B and ran controlled experiments:

Scenario	Total Latency	Prompt Eval	Decode
Cold request	~8,488 ms	177 ms	5,238 ms
Warm request	~5,520 ms	47 ms	5,372 ms
Prefix-related	~5,891 ms	47 ms	5,747 ms

Warm requests drop prompt evaluation from 177ms → 47ms. But total latency is still ~5.5 seconds because decode dominates.

This is the key insight: caching helps prefill, but token generation is the real bottleneck in real inference systems.

GPU Memory Model

GPU memory is modeled as discrete fixed-size blocks (16MB each):

total_blocks = total_vram_mb / block_size
required_blocks = ceil(kv_size_mb / block_size)

Best-fit placement selects the node with the minimum leftover blocks after allocation, reducing fragmentation.

Under memory pressure:

Attempt allocation
If insufficient → trigger eviction of oldest inactive request
Retry allocation

4. If still insufficient → reject request with explicit reason

Admission Control Under Load

The most important result from the concurrent benchmark:

Concurrency	Avg Latency	P95 Latency	Throughput
1	5,771 ms	5,771 ms	0.17 req/s
3	10,963 ms	16,299 ms	0.18 req/s
5	16,560 ms	27,744 ms	0.18 req/s
10	29,040 ms	53,525 ms	0.19 req/s

Throughput stays flat while latency explodes. This is classic queuing behavior — the bottleneck is the inference runtime, not the control plane.

With admission control (--max-active=3):

	No Control	With Control
Accepted	10	3
P95 Latency	~53.5s	~20.7s

Good systems don't try to serve everyone. They protect latency by rejecting excess load.

What I Learned

Prefix reuse is valuable but not sufficient. Caching eliminates prefill cost but generation cost dominates real LLM serving. Effective optimization needs to address both.

Single-request latency is misleading. Always benchmark under concurrency. P95 at concurrency=10 was nearly 3× the single-request time.

Admission control is more important than caching. A system that accepts everything under load will have terrible tail latency. Reject early, protect your SLA.

WAL-backed metadata is fast. Storage recovery for 5,000 cache entries takes ~20ms — completely invisible compared to inference latency. Persistence is free at this scale.

Try It Yourself

git clone --recurse-submodules https://github.com/NasitSony/llm-serving-cache.git
cd llm-serving-cache
cmake -S . -B build
cmake --build build

./build/routing_demo
./build/cache_register_demo

GitHub: https://github.com/NasitSony/llm-serving-cache

This project is the inference serving layer of a larger AI infrastructure stack. The storage layer underneath is VeriStore. The workload orchestration layer above is Veriflow.

If you found this useful, a ⭐ on GitHub goes a long way!

Top comments (2)

Harjot Singh • May 31

A KV-cache control plane with real benchmarks is exactly the kind of unglamorous-but-decisive infra work that determines whether LLM inference is affordable at scale - the KV cache is usually the silent memory hog and the thing that caps your concurrency, so treating it as a managed resource (eviction policy, reuse across requests, prefix sharing) instead of letting the runtime do whatever is where the real throughput/cost wins live. And you put benchmarks on it, which is the part most "I optimized inference" posts skip, so respect for that.

This resonates because it's the same lesson at a different layer: the leverage is in the control plane, not the model. Cheap, fast inference comes from routing, caching, and reuse, not from a smarter model. That's a core piece of how I keep Moonshift affordable - it's a multi-agent pipeline that takes a prompt to a deployed SaaS, and the cost discipline (route each job to the cheapest capable model, cache and reuse aggressively) is what lets a full build land ~$3 flat instead of a runaway token bill. First run's free, no card. Genuinely strong work. Two things I'd love to know: what eviction/reuse policy won in your benchmarks, and did you get prefix-cache sharing across different requests, or is the reuse within-session only? Cross-request prefix reuse is where the big wins usually hide.

Nasit Sony • May 31

Thanks! I completely agree that the control plane is where many of the practical wins live. In my current prototype the largest measured gains came from prefix reuse and admission control under load. The benchmarked system currently supports prefix-aware routing and cache reuse, while cross-request sharing is still relatively simple compared to production systems like vLLM. One area I'm planning to explore next is more sophisticated block-level management and eviction under memory pressure, since that's where the control plane decisions become really interesting. I'd also be interested in hearing how Moonshift handles routing and cache reuse across its multi-agent workflows.