DEV Community

Athreya aka Maneshwar

Posted on May 26

Negative Lookups in Bf-Tree: Caching Things That Don't Exist

#webdev #programming #database #architecture

Hello, I'm Maneshwar. I'm building git-lrc, a Micro AI code reviewer that runs on every commit. It is free and source-available on Github. Star git-lrc to help devs discover the project. Do give it a try and share your feedback for improving the project.

Most caching systems optimize for finding records quickly.

But what happens when a query repeatedly asks for a key that does not exist?

SELECT * WHERE id = 999999;

Traditional record caches struggle here. If a record is missing from cache, the system cannot tell whether:

the record was never cached, or
the record truly does not exist

So it falls back to disk and checks the leaf page again.

Repeated negative searches become repeated I/O.

Bf-Tree Caches Missing Records Too

Bf-Tree treats a failed lookup as useful information.

When a key is searched and confirmed absent, it inserts a phantom record into the mini-page:

Search key=42
      ↓
Not found on disk
      ↓
Store phantom record

Future lookups:

Search key=42
      ↓
Phantom record found
      ↓
Return "not found"

No leaf-page access needed.

The absence of data becomes cached state.

Four Record Types Inside a Mini-Page

By this point, mini-pages can contain multiple record types:

Type	Dirty	Exists
Insert	Yes	Yes
Cache	No	Yes
Tombstone	Yes	No
Phantom	No	No

A phantom record is essentially:

"We already checked. This key doesn't exist."

This is surprisingly useful for workloads with frequent failed lookups.

Recovery Still Looks Familiar

Despite mini-pages and phantom records, durability remains conventional.

Bf-Tree uses:

Write-ahead logging (WAL) before commits
Checkpointing/snapshots to persist state
Recovery via WAL replay

Crash recovery roughly becomes:

Load snapshot
      ↓
Rebuild in-memory pages
      ↓
Replay WAL
      ↓
Restore latest state

So while Bf-Tree changes how pages are cached and merged, persistence still resembles traditional database systems.

The unusual idea is not recovery.

The unusual idea is this:

Bf-Tree treats "record does not exist" as something worth caching.

AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

⭐ Star it on GitHub:

HexmosTech / git-lrc

Free, Micro AI Code Reviews That Run on Commit

git-lrc

Free, Micro AI Code Reviews That Run on Commit

AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.

See It In Action

See git-lrc catch serious security issues such as leaked credentials, expensive cloud operations, and sensitive material in log statements

git-lrc-intro-60s.mp4

Why

🤖 AI agents silently break things. Code removed. Logic changed. Edge cases gone. You won't notice until production.
🔍 Catch it before it ships. AI-powered inline comments show you exactly what changed and what looks wrong.
…

View on GitHub

Top comments (2)

mote • May 29

The negative lookup caching angle is clever — it solves the pathological case where you're repeatedly checking for absent keys.

A few edge cases worth considering:

1. Cache invalidation scope
If the B-Tree is write-heavy, negative entries become stale fast. You're essentially caching "this key doesn't exist at time T" — but by T+1, it might. How do you invalidate negative cache entries on insert?

2. Memory vs. disk tradeoff
Negative lookups are cheap for in-memory indices, but for on-disk trees like RocksDB, the benefit is smaller. The page cache already buffers recently-accessed pages. The marginal gain from caching absent keys is mostly visible in hot-path workloads with high cache hit rates.

3. Consistency models
If you're running with eventual consistency (replication), negative entries can mislead readers on stale replicas. "Key X doesn't exist" may be true on node A but false on node B.

The pattern makes sense for embedded storage engines with single-writer workloads. Where's this running?

mote • May 31

The "negative cache" framing is interesting, but I'd push back slightly on the performance assumptions. Caching things that don't exist has an asymmetry that caching things that do exist doesn't: the miss itself is cheap, but the hit-check cost compounds at scale.

The key question is your false positive rate. If 30% of lookups are for keys that legitimately don't exist, you're burning cache capacity on noise. If it's closer to 5%, the caching strategy pays off. The break-even point depends heavily on the workload distribution.

For Bloom filters as a pre-filter: the interesting design choice is what happens when the BF says "probably exists" but the Bf-Tree says "definitely doesn't". That's a different error category than a pure cache miss â you've now paid the cost of both lookups. Getting the BF false positive rate tuned to match your actual negative lookup rate is what makes this a real optimization rather than just complexity overhead.

What's the memory footprint comparison at different negative lookup rates in your benchmarks?

Some comments may only be visible to logged-in visitors. Sign in to view all comments.