Most engineers know arrays, hash maps, and trees. But behind the scenes, systems like Redis, Cassandra, BigQuery, and Google Chrome rely on a different breed of data structures - ones that trade perfect accuracy for massive memory savings.
These are probabilistic data structures. They give you approximate answers using a fraction of the memory that exact approaches would need.
In this post, I'll walk through the three most useful ones:
| Data Structure | Question It Answers | Used By |
|---|---|---|
| Bloom Filter | "Is this item in the set?" | Cassandra, Chrome, Akamai |
| Count-Min Sketch | "How many times has this appeared?" | Redis, PostgreSQL, network monitors |
| HyperLogLog | "How many unique items are there?" | Redis, BigQuery, Spark |
Let's break each one down.
1. Bloom Filter - Membership Testing
The problem: You have 1 billion usernames. You want to check if a username is taken. Storing them all in a hash set needs 20+ GB of RAM.
The solution: A Bloom filter does it in ~1.2 GB with 1% false positive rate.
How it works
A Bloom filter uses a bit array and multiple hash functions:
- To add an item: hash it with each function, set those bit positions to 1
-
To check an item: hash it, check if ALL positions are 1
- If any bit is 0 β definitely not in the set
- If all bits are 1 β probably in the set (false positive possible)
Add "hello" (hashes to positions 2, 5, 9):
Index: 0 1 2 3 4 5 6 7 8 9 10 11
Bits: 0 0 1 0 0 1 0 0 0 1 0 0
The key guarantee: no false negatives. If the filter says "no," the item was never added. Period.
Why it matters
- Cassandra uses Bloom filters to skip disk reads for keys that don't exist - saving 90%+ of unnecessary I/O
- Chrome checks URLs against a Bloom filter of known malicious sites before hitting Google's servers
- Akamai CDN avoids caching one-hit wonders by requiring items to appear in the Bloom filter before caching
Quick implementation (Python)
class BloomFilter:
def __init__(self, size, hash_count):
self.size = size
self.hash_count = hash_count
self.bit_array = [0] * size
def add(self, item):
for i in range(self.hash_count):
pos = hash(f"{item}:{i}") % self.size
self.bit_array[pos] = 1
def might_contain(self, item):
return all(
self.bit_array[hash(f"{item}:{i}") % self.size]
for i in range(self.hash_count)
)
π Deep dive: I wrote a comprehensive guide covering the math, optimal parameter selection, Counting Bloom Filters, Cuckoo Filters, and production implementation details - How Bloom Filters Work
2. Count-Min Sketch - Frequency Estimation
The problem: You're processing a stream of 1 billion events. You want to know which events are most frequent. Storing a counter per unique event would need gigabytes.
The solution: Count-Min Sketch tracks frequencies in a few hundred KB of fixed memory.
How it works
Count-Min Sketch uses a 2D array (d rows Γ w columns). Each row has its own hash function.
- To add an item: hash with each function, increment the counter at each (row, column)
- To query an item: hash with each function, return the minimum of all counters
Add "apple" (hashes to columns 2, 5, 1):
Row 1: [0, 0, 1, 0, 0, 0, 0, 0]
Row 2: [0, 0, 0, 0, 0, 1, 0, 0]
Row 3: [0, 1, 0, 0, 0, 0, 0, 0]
Query "apple": min(1, 1, 1) = 1 β
Why the minimum? Collisions can only increase counters. The minimum across all rows is the least-inflated estimate. This means Count-Min Sketch never underestimates - it might say 1,005 when the true count is 1,000, but never 995.
Where it's used
- Network monitoring: Track bytes per IP to detect DDoS attacks
-
Redis: Has built-in Count-Min Sketch support (
CMS.INCRBY,CMS.QUERY) - PostgreSQL: Uses sketch-based statistics internally for query planning
- Finding heavy hitters: Identify the top-K most frequent items in a stream
The killer feature: mergeability
Count-Min Sketches merge by adding corresponding counters. Each server tracks local event frequencies, then you merge periodically for global counts. Perfect for distributed architectures.
# Merge two sketches
for i in range(depth):
for j in range(width):
merged.table[i][j] = cms1.table[i][j] + cms2.table[i][j]
π Deep dive: Full walkthrough with heavy hitter detection algorithms, parameter tuning, and production Redis examples - How Count-Min Sketch Works
3. HyperLogLog - Cardinality Estimation
The problem: You need to show "unique visitors today" for a site with 100 million page views. Storing every visitor ID to count uniques would cost gigabytes.
The solution: HyperLogLog gives you the unique count using only 12 KB of memory - with ~1% error.
The core insight
When you hash items uniformly, about half start with 0, a quarter start with 00, an eighth start with 000, and so on.
If you've seen a hash with 20 leading zeros, you've probably seen around 2Β²β° β 1 million unique items. That pattern is rare enough that it only appears after processing many items.
How it works
- Hash each item to a 64-bit value
- Use first 14 bits to pick one of 16,384 registers (buckets)
- Count leading zeros in the remaining bits
- Store the maximum leading zero count per register
- Estimate cardinality using a harmonic mean across all registers
Add "user_123":
Hash β 64 bits
First 14 bits β Register 5,782
Remaining bits β 7 leading zeros
registers[5782] = max(old_value, 7+1) = 8
Duplicates don't change anything - same item always hashes the same way. That's how it counts unique items without storing them.
Memory vs. accuracy
| Registers | Memory | Standard Error |
|---|---|---|
| 1,024 | 5 KB | 3.25% |
| 16,384 | 12 KB | 0.81% |
| 65,536 | 48 KB | 0.41% |
Redis defaults to 16,384 registers. If the true count is 1,000,000, it'll report between ~991,900 and ~1,008,100.
Where it's used
-
Redis:
PFADD/PFCOUNT- 12 KB per key, any number of elements -
Google BigQuery:
APPROX_COUNT_DISTINCT()- orders of magnitude faster than exact count -
Apache Spark:
approx_count_distinct()for distributed datasets - Presto/Trino, Druid, Elasticsearch: All use HyperLogLog internally
Merging for distributed counting
HyperLogLog sketches merge by taking the max per register:
for i in range(num_registers):
merged.registers[i] = max(hll1.registers[i], hll2.registers[i])
Each shard counts locally, merge at the end. No raw data transfer needed.
π Deep dive: Full algorithm walkthrough with bias correction, small/large range corrections, and implementation - How HyperLogLog Works
Comparison: Which One Do You Need?
| Bloom Filter | Count-Min Sketch | HyperLogLog | |
|---|---|---|---|
| Answers | "Is X in the set?" | "How often did X appear?" | "How many unique items?" |
| Error type | False positives only | Overestimates only | Β± percentage error |
| Memory | ~1.2 MB per 1M items | ~100 KB (fixed) | ~12 KB (fixed) |
| Mergeable? | OR bit arrays | Add counters | Max per register |
| Deletable? | No (standard) | No (standard) | N/A |
| Used by | Cassandra, Chrome | Redis, PostgreSQL | Redis, BigQuery, Spark |
Decision flowchart
Ask yourself what question you need answered:
- "Is this item in the set?" β Bloom Filter
- "How many times has this item appeared?" β Count-Min Sketch
- "How many unique items are there?" β HyperLogLog
- "I need exact answers" β Use a hash map/set (and accept the memory cost)
The Common Thread
All three structures share the same philosophy:
- Trade accuracy for memory - accept small, bounded errors to achieve orders-of-magnitude memory savings
- Use hash functions - map inputs uniformly to extract statistical properties
- Support merging - combine results from distributed nodes without raw data transfer
- Never store the actual items - they answer questions about data without remembering the data itself
This is why they appear in virtually every large-scale system. When you're processing billions of items, exact answers often aren't worth the memory cost.
What to Read Next
If you found this useful, I've written detailed guides on each structure with full implementations, mathematical proofs, and production patterns:
- π How Bloom Filters Work - includes Counting Bloom Filters, Cuckoo Filters, and optimal parameter selection
- π How Count-Min Sketch Works - heavy hitter detection, Redis integration, and distributed merging
- π How HyperLogLog Works - the full algorithm with bias correction and Redis PFADD/PFCOUNT examples
For more on data structures and system design, check out singhajit.com - I publish weekly deep dives on the engineering behind real-world systems.
What probabilistic data structure have you used in production? Drop a comment below - I'd love to hear about your use case. π
Top comments (0)