How I Built a KV-Cache Control Plane for LLM Inference — With Real Benchmark Results
LLM inference is expensive. The prefill step — processing the prompt — is the biggest cost. If you've seen the same prompt before, you shouldn't have to recompute it.
That's the core idea behind KV-cache reuse. But in a distributed system with multiple inference nodes, a new problem emerges: where is the cached prefix stored, and how do you route requests to maximize reuse?
I built llm-serving-cache to answer that question — a metadata-driven control plane for LLM KV-cache placement and routing.
The Problem
In a single-node setup, KV-cache reuse is straightforward. The cache is local and the router is trivial.
In a distributed setup:
- Cached prefixes are scattered across nodes
- The same prompt might be cached on node-a but the request lands on node-b
- Cache misses are expensive — full prefill cost, every time
- GPU memory is finite — you need admission control and eviction You need a control plane that knows where every cached prefix lives and routes requests intelligently.
The Architecture
Client Request
↓
Router
↓
Session Affinity Check → route to same node if session exists
↓
Exact Cache Hit? → reuse cached result, skip prefill
↓
Prefix Match? → reuse partial computation
↓
Cache Miss → select best node, trigger cache fill
↓
[If full] Evict → remove oldest inactive request
↓
Inference + Register → store new cache entry
↓
WAL-backed Metadata Store
Core Components
Router — handles exact hits, prefix matches, session affinity, and cache misses.
Node Registry — tracks available nodes, GPU memory, and utilization.
Metadata Store — persists cache entries and session routes via a WAL-backed KV engine (VeriStore).
Placement Policy — best-fit node selection based on available GPU memory blocks.
Benchmark Results
I ran controlled benchmarks across five cache strategies:
| Scenario | Avg Latency | P95 Latency | Hit Rate | Rejection Rate |
|---|---|---|---|---|
| No Cache | 1405 ms | 1405 ms | 0% | 0% |
| Prefix Reuse | 985 ms | 1405 ms | 50% | 0% |
| Exact Cache | 205 ms | 205 ms | 100% | 0% |
| GPU-Aware | 843 ms | 1405 ms | 25% | 25% |
| GPU-Aware + Eviction | 1895 ms | 4205 ms | 25% | 0% |
Key observations:
- Exact cache reuse reduces latency by ~85% vs no cache
- Prefix reuse improves average latency but not tail latency — P95 stays high when misses are still present
- Eviction reduces rejection but increases latency by admitting previously rejected expensive requests
Real Inference Validation (Ollama)
Benchmarks are useful, but I wanted to validate against real inference. I integrated Ollama running Llama 3.1 8B and ran controlled experiments:
| Scenario | Total Latency | Prompt Eval | Decode |
|---|---|---|---|
| Cold request | ~8,488 ms | 177 ms | 5,238 ms |
| Warm request | ~5,520 ms | 47 ms | 5,372 ms |
| Prefix-related | ~5,891 ms | 47 ms | 5,747 ms |
Warm requests drop prompt evaluation from 177ms → 47ms. But total latency is still ~5.5 seconds because decode dominates.
This is the key insight: caching helps prefill, but token generation is the real bottleneck in real inference systems.
GPU Memory Model
GPU memory is modeled as discrete fixed-size blocks (16MB each):
total_blocks = total_vram_mb / block_size
required_blocks = ceil(kv_size_mb / block_size)
Best-fit placement selects the node with the minimum leftover blocks after allocation, reducing fragmentation.
Under memory pressure:
- Attempt allocation
- If insufficient → trigger eviction of oldest inactive request
- Retry allocation
4. If still insufficient → reject request with explicit reason
Admission Control Under Load
The most important result from the concurrent benchmark:
| Concurrency | Avg Latency | P95 Latency | Throughput |
|---|---|---|---|
| 1 | 5,771 ms | 5,771 ms | 0.17 req/s |
| 3 | 10,963 ms | 16,299 ms | 0.18 req/s |
| 5 | 16,560 ms | 27,744 ms | 0.18 req/s |
| 10 | 29,040 ms | 53,525 ms | 0.19 req/s |
Throughput stays flat while latency explodes. This is classic queuing behavior — the bottleneck is the inference runtime, not the control plane.
With admission control (--max-active=3):
| No Control | With Control | |
|---|---|---|
| Accepted | 10 | 3 |
| P95 Latency | ~53.5s | ~20.7s |
Good systems don't try to serve everyone. They protect latency by rejecting excess load.
What I Learned
Prefix reuse is valuable but not sufficient. Caching eliminates prefill cost but generation cost dominates real LLM serving. Effective optimization needs to address both.
Single-request latency is misleading. Always benchmark under concurrency. P95 at concurrency=10 was nearly 3× the single-request time.
Admission control is more important than caching. A system that accepts everything under load will have terrible tail latency. Reject early, protect your SLA.
WAL-backed metadata is fast. Storage recovery for 5,000 cache entries takes ~20ms — completely invisible compared to inference latency. Persistence is free at this scale.
Try It Yourself
git clone --recurse-submodules https://github.com/NasitSony/llm-serving-cache.git
cd llm-serving-cache
cmake -S . -B build
cmake --build build
./build/routing_demo
./build/cache_register_demo
GitHub: https://github.com/NasitSony/llm-serving-cache
This project is the inference serving layer of a larger AI infrastructure stack. The storage layer underneath is VeriStore. The workload orchestration layer above is Veriflow.
If you found this useful, a ⭐ on GitHub goes a long way!
Top comments (2)
A KV-cache control plane with real benchmarks is exactly the kind of unglamorous-but-decisive infra work that determines whether LLM inference is affordable at scale - the KV cache is usually the silent memory hog and the thing that caps your concurrency, so treating it as a managed resource (eviction policy, reuse across requests, prefix sharing) instead of letting the runtime do whatever is where the real throughput/cost wins live. And you put benchmarks on it, which is the part most "I optimized inference" posts skip, so respect for that.
This resonates because it's the same lesson at a different layer: the leverage is in the control plane, not the model. Cheap, fast inference comes from routing, caching, and reuse, not from a smarter model. That's a core piece of how I keep Moonshift affordable - it's a multi-agent pipeline that takes a prompt to a deployed SaaS, and the cost discipline (route each job to the cheapest capable model, cache and reuse aggressively) is what lets a full build land ~$3 flat instead of a runaway token bill. First run's free, no card. Genuinely strong work. Two things I'd love to know: what eviction/reuse policy won in your benchmarks, and did you get prefix-cache sharing across different requests, or is the reuse within-session only? Cross-request prefix reuse is where the big wins usually hide.
Thanks! I completely agree that the control plane is where many of the practical wins live. In my current prototype the largest measured gains came from prefix reuse and admission control under load. The benchmarked system currently supports prefix-aware routing and cache reuse, while cross-request sharing is still relatively simple compared to production systems like vLLM. One area I'm planning to explore next is more sophisticated block-level management and eviction under memory pressure, since that's where the control plane decisions become really interesting. I'd also be interested in hearing how Moonshift handles routing and cache reuse across its multi-agent workflows.