DEV Community

Ravi Patel
Ravi Patel

Posted on • Originally published at ssimplifi.com

Exact vs semantic caching for LLMs: when each wins, measured

If you're building on top of an LLM API and the bill is starting to bite, you've probably read that caching is the answer. The follow-up question is which kind of caching, and the honest answer is: usually both, but for different reasons. Exact-match caching costs you almost nothing to run and never returns a wrong answer; the catch is that it hits maybe one in ten requests in production. Semantic caching catches several times that volume but introduces a correctness risk you have to engineer for. This post walks through where each one wins, the math behind the tradeoff, and how to decide what to run for your workload.

Caching is part of AI API caching as a discipline — exact and semantic are two of the three layers; the third is provider-native cache passthrough, covered separately.

Definitions, briefly

Exact-match caching computes a deterministic fingerprint of the request (typically SHA-256 over the normalized messages array, model name, temperature, and other request parameters), then looks up that fingerprint in a key-value store like Redis. If the fingerprint exists, return the cached response. Lookup is O(1) and sub-10ms p95. The store is bounded by your cache size budget; entries evict by LRU or TTL.

Semantic caching embeds the user's prompt with an embedding model (often a small fast one like BGE-small, MiniLM, or text-embedding-3-small), then queries a vector database for the nearest stored embedding. If the cosine similarity between the incoming embedding and the nearest stored one exceeds a threshold (usually 0.93–0.97), serve the cached response associated with that stored embedding. Lookup is O(log n) in the number of stored entries and runs around 20–40ms p95 including the embedding inference.

Both layers cache the full response. Provider-native passthrough is different — it caches the prefix processing on the provider's side — and is covered in Anthropic prompt caching, explained. The rest of this post stays on the response-caching layers.

The hit-rate gap is real and structural

Exact-match cache miss-rates are high in real LLM traffic for a reason. Production prompts almost always carry per-request context — a user name, a session ID, a current timestamp, a recently-retrieved RAG passage, a varying tool list. Even if the underlying user intent is identical across two requests, the prompt strings are byte-different, and the SHA-256 fingerprint diverges. The result is that exact caches hit on the 5–15% of traffic that's truly identical — things like cron-scheduled internal queries, deterministic system-only test calls, and duplicate-submit user actions.

VERIFY (founder): replace the 5–15% range above with the actual exact-cache hit rate measured on Prism production traffic over the last 30 days, broken down by task_type if available. Source: usage_logs aggregation where cache_status='hit-exact'.

Semantic caches catch the variations exact caches miss. Two users asking "what's your refund policy?" and "how do I get my money back?" send byte-different prompts, embed to nearly-parallel vectors, and the cosine similarity between them lands around 0.96–0.98. A semantic cache at threshold 0.95 returns the same answer to both. Production semantic-cache hit rates are typically 25–50% on top of whatever the exact cache caught, depending heavily on workload shape: support chatbots and FAQ systems see the high end; tool-calling agents with variable retrieval contexts see the low end.

VERIFY (founder): replace the 25–50% range with Prism's measured semantic hit rate at the default 0.95 threshold, segmented by task_type (simple / code / reasoning / complex). Source: usage_logs where cache_status='hit-semantic'.

The structural reason for the gap is that user intent has lower-dimensional structure than user input. There are thousands of ways to ask "what's your refund policy" and only one refund policy. Embeddings collapse the input dimensionality down to the intent, which is what makes semantic caching work at all.

When exact wins

Exact-match is the right choice — and often the only right choice — when any of these hold:

  • Your traffic is deterministic. Cron jobs, ETL pipelines, evaluation runs, regression tests. The same prompt fires the same way every time. Exact-match hit rates can exceed 90% here, and you pay zero embedding overhead.
  • Correctness is non-negotiable. Legal, medical, financial workloads where serving a wrong-but-similar answer is a real liability. Exact cache is provably correct: it returns the same response if and only if the request was byte-identical.
  • Your prompts are short and the cache is small. If you're caching 50K entries that are 1KB each, exact cache fits in 50MB of Redis and lookup is trivial. Semantic caching's embedding-vector storage (1.5KB per BGE-small entry plus vector-index overhead) dominates at this scale.
  • You can't tolerate the embedding latency tail. Exact lookup is sub-10ms p95; semantic adds 20–40ms p95 for the embedding inference. On a chat UX where users feel anything above 200ms, every millisecond counts.

When semantic wins

Semantic-match earns its complexity when:

  • Your users phrase the same question 10 different ways. Customer-support chatbots, in-product help, FAQ surfaces. Exact-match cache hit rates in these workloads sit in the low single digits; semantic at 0.95 can climb to 40%+.
  • You're serving a knowledge-grounded LLM where the underlying answers don't change often. Documentation Q&A, policy lookups, "how do I do X" tutorials. The cache stays valid for hours or days because the source-of-truth content updates slowly.
  • The unit-economics math justifies the embedding overhead. A semantic hit on a $0.015 call (typical Sonnet-class input + output) avoids a $0.015 charge. The embedding inference cost on BGE-small is around $0.00002 per call. The break-even hit rate is less than 0.2% — you almost can't lose money running semantic caching as long as your false-positive rate is acceptable.

The false-positive question is where most semantic-caching implementations fail. A cache that returns the wrong answer for the customer's question is worse than no cache at all — the customer leaves with bad information, blames the product, and you may not even know it happened. The discipline that makes this safe is threshold engineering, covered next.

The threshold math

The cosine similarity threshold is the single tunable lever on a semantic cache. Set it too low and you serve confidently-wrong answers; set it too high and you don't catch enough hits to be worth the embedding overhead. The defensible default is 0.95, and here's why.

Think of it as a precision/recall problem on the question "is this a true match?" Threshold tunes the boundary:

  • Threshold 0.99: near-zero false-positive rate but you only catch byte-identical-after-normalization requests. Effectively the same as exact-match, minus the simplicity. Not useful.
  • Threshold 0.95 (default): false positives in the low single digits on most real-world workloads. Recall is good — most "user asked the same thing in different words" cases land at 0.96+ similarity. Worth running.
  • Threshold 0.90: false positives jump to 8–15% on broad chat workloads. The kinds of misfires here are semantically related but distinct questions — "what's your refund policy" and "what's your shipping policy" both embed near each other and a 0.90 threshold collapses them. Almost never the right call.
  • Threshold 0.85: false positives are catastrophic — the cache becomes effectively a content-aware random-response generator. Stay away unless you have a downstream LLM judge re-validating every hit.

The shape of this curve is workload-dependent. A narrow workload (e.g. a chatbot for a single product's documentation) can run threshold 0.92 safely because all the relevant questions cluster tightly. A broad workload (e.g. a general-purpose assistant) needs to run 0.96+ because the question space is more spread out.

The right approach is to instrument it. Run the cache at 0.95, log every hit's similarity score, periodically sample 100 hits and have a human judge whether the cached answer was appropriate. If false positives are <2%, you can experiment with lowering the threshold to recover more hits. If false positives are >5%, raise it.

A worked example

Suppose you operate a support chatbot built on Claude Sonnet. Traffic profile:

  • 20,000 chat completions per day
  • Average prompt length: 800 input tokens (system prompt + retrieved context + user message)
  • Average response: 300 output tokens
  • Claude Sonnet pricing (illustrative): $3 per million input tokens, $15 per million output tokens

Provider cost without caching: 20,000 × (800 × $3 + 300 × $15) / 1,000,000 = $138 / day (~$4,200 / month).

Now layer in caching:

  • Exact cache catches 8% of traffic. Saved: 8% × $138 = $11/day.
  • Semantic cache catches 38% of the remaining traffic at threshold 0.95. Saved: 38% × 92% × $138 = $48/day.
  • Total avoided spend: $59/day, or about 43% of the bill.

The semantic cache's embedding cost: 20,000 × $0.00002 = $0.40/day. Negligible.

The infrastructure cost: Redis cache (~$10/month managed) + Upstash Vector (~$30/month for 500K vectors). Total ~$40/month against a savings of ~$1,800/month. Pay-back is one day of traffic.

VERIFY (founder): substitute the worked example with one drawn from a real Prism customer profile or representative aggregated data, with current pricing. The illustrative numbers above are reasonable but worth grounding in actual customer shape.

The point isn't the specific numbers — it's that the cost-of-running both layers is rounding-error against the savings on a workload where caching works at all. The only real question is the false-positive rate, which threshold engineering solves.

How Prism runs both

Prism runs all three caching layers — exact, semantic, and provider-native passthrough — concurrently by default on every paid request. The dispatcher looks up exact first (Redis, sub-8ms p95), falls through to semantic on miss (Upstash Vector with BGE-small embeddings at 0.95 cosine, ~30ms p95 including the embedding call), and otherwise proxies to the provider with cache-control markers attached for provider-native passthrough. Every response carries an X-Prism-Cache-Status header indicating which layer (if any) served the request, plus X-Prism-Cache-Saved-Cents showing the actual dollars saved.

A couple of design choices worth calling out:

Fingerprint normalization. Prism normalizes message arrays before fingerprinting — strips internal cache-control markers, sorts deterministic keys, and tokenizes consistently — so trivially-equivalent requests hash to the same key. The discipline article Prompt cache fingerprinting pitfalls walks through the edge cases that bit us during v1.1 development.

Threshold is per-scope configurable on Pro+. Default is 0.95, but Pro+ accounts can tune it per project via the X-Prism-Cache-Threshold header. The cache inspector at /dashboard/cache shows hit-rate-at-threshold curves so you can see what raising or lowering would do.

Streaming compatibility. Cache hits return non-streaming JSON regardless of the request's stream=true flag. Mid-stream caching is a footgun (a dropped stream would poison the cache); we sidestep it entirely.

You can model your own workload's caching ROI in the savings calculator before signing up — same pricing inputs we use internally.

Decision checklist

If you're picking what to run for your workload:

  1. Always run exact-match. Cost is trivial, hits are pure wins, correctness is guaranteed. There's no scenario where running it is worse than not running it.
  2. Run semantic if your workload has paraphrasable intent. Customer support, in-product help, FAQ, documentation Q&A — yes. Pure tool-calling agents with high-cardinality context — probably not.
  3. Pick threshold 0.95 to start. Instrument false-positive rate. Tune. Default is conservative on purpose. Sampling-based validation tells you what you can safely lower to.
  4. Layer on provider-native passthrough for any workload with a stable system prompt over a few hundred tokens. Anthropic's 90% off cache-read tokens and OpenAI's 50% off cached input are independent of the layers above and stack cleanly.

The economics on response caching for LLM APIs are unusually favorable — false-positive risk is the only real cost, and that's an engineering discipline problem, not an unsolvable one.


FAQ

What's the cosine similarity threshold I should start with?

0.95. It's conservative enough to keep false positives in the low single digits on most production workloads while still catching most real paraphrases. Tune from there based on sampled false-positive rate, not by intuition.

Doesn't semantic caching break for code prompts?

Often yes, depending on the embedding model. Code with the same intent but different variable names embeds far apart in most general-purpose embedding spaces, so semantic hit rates on code workloads are typically low. Two options: use a code-specialized embedding model (e.g. BGE-code), or accept that semantic caching on code prompts isn't where the wins live and rely on exact + provider-native.

Can I run semantic caching without an embedding model?

No. Semantic caching is defined as embedding-based similarity matching. What you can do is run exact + provider-native passthrough only, which catches a real chunk of traffic with no embedding dependency.

What happens when the underlying answer changes — is the cache poisoned?

This is the cache-invalidation problem and it's real. Two mitigations: TTL (entries expire after some configurable interval) and explicit invalidation (purge entries matching a pattern when source-of-truth content changes). Prism supports both — TTL is configurable per project on Pro+, and the cache inspector at /dashboard/cache supports per-pattern eviction.

Do I need a vector database for semantic caching?

Practically, yes. You need similarity search over thousands or millions of stored embeddings, which requires an index (HNSW or similar). Self-hosted options include pgvector and Qdrant; managed options include Pinecone and Upstash Vector. Prism uses Upstash Vector internally.


Want to see how three-layer caching applies to your workload? Read the parent guide on AI API caching for the full framework, or model your savings with the savings calculator. The semantic cache glossary entry covers the term in shorter form.

Top comments (0)