DEV Community

Nasit Sony
Nasit Sony

Posted on

How I Built a KV-Cache Control Plane for LLM Inference — With Real Benchmark Results

How I Built a KV-Cache Control Plane for LLM Inference — With Real Benchmark Results

LLM inference is expensive. The prefill step — processing the prompt — is the biggest cost. If you've seen the same prompt before, you shouldn't have to recompute it.

That's the core idea behind KV-cache reuse. But in a distributed system with multiple inference nodes, a new problem emerges: where is the cached prefix stored, and how do you route requests to maximize reuse?

I built llm-serving-cache to answer that question — a metadata-driven control plane for LLM KV-cache placement and routing.


The Problem

In a single-node setup, KV-cache reuse is straightforward. The cache is local and the router is trivial.

In a distributed setup:

  • Cached prefixes are scattered across nodes
  • The same prompt might be cached on node-a but the request lands on node-b
  • Cache misses are expensive — full prefill cost, every time
  • GPU memory is finite — you need admission control and eviction You need a control plane that knows where every cached prefix lives and routes requests intelligently.

The Architecture

Client Request
      ↓
Router
      ↓
Session Affinity Check   → route to same node if session exists
      ↓
Exact Cache Hit?         → reuse cached result, skip prefill
      ↓
Prefix Match?            → reuse partial computation
      ↓
Cache Miss               → select best node, trigger cache fill
      ↓
[If full] Evict          → remove oldest inactive request
      ↓
Inference + Register     → store new cache entry
      ↓
WAL-backed Metadata Store
Enter fullscreen mode Exit fullscreen mode

Core Components

Router — handles exact hits, prefix matches, session affinity, and cache misses.

Node Registry — tracks available nodes, GPU memory, and utilization.

Metadata Store — persists cache entries and session routes via a WAL-backed KV engine (VeriStore).

Placement Policy — best-fit node selection based on available GPU memory blocks.


Benchmark Results

I ran controlled benchmarks across five cache strategies:

Scenario Avg Latency P95 Latency Hit Rate Rejection Rate
No Cache 1405 ms 1405 ms 0% 0%
Prefix Reuse 985 ms 1405 ms 50% 0%
Exact Cache 205 ms 205 ms 100% 0%
GPU-Aware 843 ms 1405 ms 25% 25%
GPU-Aware + Eviction 1895 ms 4205 ms 25% 0%

Key observations:

  • Exact cache reuse reduces latency by ~85% vs no cache
  • Prefix reuse improves average latency but not tail latency — P95 stays high when misses are still present

- Eviction reduces rejection but increases latency by admitting previously rejected expensive requests

Real Inference Validation (Ollama)

Benchmarks are useful, but I wanted to validate against real inference. I integrated Ollama running Llama 3.1 8B and ran controlled experiments:

Scenario Total Latency Prompt Eval Decode
Cold request ~8,488 ms 177 ms 5,238 ms
Warm request ~5,520 ms 47 ms 5,372 ms
Prefix-related ~5,891 ms 47 ms 5,747 ms

Warm requests drop prompt evaluation from 177ms → 47ms. But total latency is still ~5.5 seconds because decode dominates.

This is the key insight: caching helps prefill, but token generation is the real bottleneck in real inference systems.


GPU Memory Model

GPU memory is modeled as discrete fixed-size blocks (16MB each):

total_blocks = total_vram_mb / block_size
required_blocks = ceil(kv_size_mb / block_size)
Enter fullscreen mode Exit fullscreen mode

Best-fit placement selects the node with the minimum leftover blocks after allocation, reducing fragmentation.

Under memory pressure:

  1. Attempt allocation
  2. If insufficient → trigger eviction of oldest inactive request
  3. Retry allocation

4. If still insufficient → reject request with explicit reason

Admission Control Under Load

The most important result from the concurrent benchmark:

Concurrency Avg Latency P95 Latency Throughput
1 5,771 ms 5,771 ms 0.17 req/s
3 10,963 ms 16,299 ms 0.18 req/s
5 16,560 ms 27,744 ms 0.18 req/s
10 29,040 ms 53,525 ms 0.19 req/s

Throughput stays flat while latency explodes. This is classic queuing behavior — the bottleneck is the inference runtime, not the control plane.

With admission control (--max-active=3):

No Control With Control
Accepted 10 3
P95 Latency ~53.5s ~20.7s

Good systems don't try to serve everyone. They protect latency by rejecting excess load.


What I Learned

Prefix reuse is valuable but not sufficient. Caching eliminates prefill cost but generation cost dominates real LLM serving. Effective optimization needs to address both.

Single-request latency is misleading. Always benchmark under concurrency. P95 at concurrency=10 was nearly 3× the single-request time.

Admission control is more important than caching. A system that accepts everything under load will have terrible tail latency. Reject early, protect your SLA.

WAL-backed metadata is fast. Storage recovery for 5,000 cache entries takes ~20ms — completely invisible compared to inference latency. Persistence is free at this scale.


Try It Yourself

git clone --recurse-submodules https://github.com/NasitSony/llm-serving-cache.git
cd llm-serving-cache
cmake -S . -B build
cmake --build build

./build/routing_demo
./build/cache_register_demo
Enter fullscreen mode Exit fullscreen mode

GitHub: https://github.com/NasitSony/llm-serving-cache


This project is the inference serving layer of a larger AI infrastructure stack. The storage layer underneath is VeriStore. The workload orchestration layer above is Veriflow.

If you found this useful, a ⭐ on GitHub goes a long way!

Top comments (2)

Collapse
 
harjjotsinghh profile image
Harjot Singh

A KV-cache control plane with real benchmarks is exactly the kind of unglamorous-but-decisive infra work that determines whether LLM inference is affordable at scale - the KV cache is usually the silent memory hog and the thing that caps your concurrency, so treating it as a managed resource (eviction policy, reuse across requests, prefix sharing) instead of letting the runtime do whatever is where the real throughput/cost wins live. And you put benchmarks on it, which is the part most "I optimized inference" posts skip, so respect for that.

This resonates because it's the same lesson at a different layer: the leverage is in the control plane, not the model. Cheap, fast inference comes from routing, caching, and reuse, not from a smarter model. That's a core piece of how I keep Moonshift affordable - it's a multi-agent pipeline that takes a prompt to a deployed SaaS, and the cost discipline (route each job to the cheapest capable model, cache and reuse aggressively) is what lets a full build land ~$3 flat instead of a runaway token bill. First run's free, no card. Genuinely strong work. Two things I'd love to know: what eviction/reuse policy won in your benchmarks, and did you get prefix-cache sharing across different requests, or is the reuse within-session only? Cross-request prefix reuse is where the big wins usually hide.

Collapse
 
nasit_sony profile image
Nasit Sony

Thanks! I completely agree that the control plane is where many of the practical wins live. In my current prototype the largest measured gains came from prefix reuse and admission control under load. The benchmarked system currently supports prefix-aware routing and cache reuse, while cross-request sharing is still relatively simple compared to production systems like vLLM. One area I'm planning to explore next is more sophisticated block-level management and eviction under memory pressure, since that's where the control plane decisions become really interesting. I'd also be interested in hearing how Moonshift handles routing and cache reuse across its multi-agent workflows.