Ajit Singh

Posted on Feb 6 • Originally published at singhajit.com

3 Data Structures Every Backend Engineer Should Know (That Aren't Trees or Hash Maps)

#datastructures #programming #backend #systemdesign

Most engineers know arrays, hash maps, and trees. But behind the scenes, systems like Redis, Cassandra, BigQuery, and Google Chrome rely on a different breed of data structures - ones that trade perfect accuracy for massive memory savings.

These are probabilistic data structures. They give you approximate answers using a fraction of the memory that exact approaches would need.

In this post, I'll walk through the three most useful ones:

Data Structure	Question It Answers	Used By
Bloom Filter	"Is this item in the set?"	Cassandra, Chrome, Akamai
Count-Min Sketch	"How many times has this appeared?"	Redis, PostgreSQL, network monitors
HyperLogLog	"How many unique items are there?"	Redis, BigQuery, Spark

Let's break each one down.

1. Bloom Filter - Membership Testing

The problem: You have 1 billion usernames. You want to check if a username is taken. Storing them all in a hash set needs 20+ GB of RAM.

The solution: A Bloom filter does it in ~1.2 GB with 1% false positive rate.

How it works

A Bloom filter uses a bit array and multiple hash functions:

To add an item: hash it with each function, set those bit positions to 1
To check an item: hash it, check if ALL positions are 1
- If any bit is 0 → definitely not in the set
- If all bits are 1 → probably in the set (false positive possible)

Add "hello" (hashes to positions 2, 5, 9):

Index:  0  1  2  3  4  5  6  7  8  9  10 11
Bits:   0  0  1  0  0  1  0  0  0  1  0  0

The key guarantee: no false negatives. If the filter says "no," the item was never added. Period.

Why it matters

Cassandra uses Bloom filters to skip disk reads for keys that don't exist - saving 90%+ of unnecessary I/O
Chrome checks URLs against a Bloom filter of known malicious sites before hitting Google's servers
Akamai CDN avoids caching one-hit wonders by requiring items to appear in the Bloom filter before caching

Quick implementation (Python)

class BloomFilter:
    def __init__(self, size, hash_count):
        self.size = size
        self.hash_count = hash_count
        self.bit_array = [0] * size

    def add(self, item):
        for i in range(self.hash_count):
            pos = hash(f"{item}:{i}") % self.size
            self.bit_array[pos] = 1

    def might_contain(self, item):
        return all(
            self.bit_array[hash(f"{item}:{i}") % self.size]
            for i in range(self.hash_count)
        )

📖 Deep dive: I wrote a comprehensive guide covering the math, optimal parameter selection, Counting Bloom Filters, Cuckoo Filters, and production implementation details - How Bloom Filters Work

2. Count-Min Sketch - Frequency Estimation

The problem: You're processing a stream of 1 billion events. You want to know which events are most frequent. Storing a counter per unique event would need gigabytes.

The solution: Count-Min Sketch tracks frequencies in a few hundred KB of fixed memory.

How it works

Count-Min Sketch uses a 2D array (d rows × w columns). Each row has its own hash function.

To add an item: hash with each function, increment the counter at each (row, column)
To query an item: hash with each function, return the minimum of all counters

Add "apple" (hashes to columns 2, 5, 1):
Row 1: [0, 0, 1, 0, 0, 0, 0, 0]
Row 2: [0, 0, 0, 0, 0, 1, 0, 0]
Row 3: [0, 1, 0, 0, 0, 0, 0, 0]

Query "apple": min(1, 1, 1) = 1 ✓

Why the minimum? Collisions can only increase counters. The minimum across all rows is the least-inflated estimate. This means Count-Min Sketch never underestimates - it might say 1,005 when the true count is 1,000, but never 995.

Where it's used

Network monitoring: Track bytes per IP to detect DDoS attacks
Redis: Has built-in Count-Min Sketch support (CMS.INCRBY, CMS.QUERY)
PostgreSQL: Uses sketch-based statistics internally for query planning
Finding heavy hitters: Identify the top-K most frequent items in a stream

The killer feature: mergeability

Count-Min Sketches merge by adding corresponding counters. Each server tracks local event frequencies, then you merge periodically for global counts. Perfect for distributed architectures.

# Merge two sketches
for i in range(depth):
    for j in range(width):
        merged.table[i][j] = cms1.table[i][j] + cms2.table[i][j]

📖 Deep dive: Full walkthrough with heavy hitter detection algorithms, parameter tuning, and production Redis examples - How Count-Min Sketch Works

3. HyperLogLog - Cardinality Estimation

The problem: You need to show "unique visitors today" for a site with 100 million page views. Storing every visitor ID to count uniques would cost gigabytes.

The solution: HyperLogLog gives you the unique count using only 12 KB of memory - with ~1% error.

The core insight

When you hash items uniformly, about half start with 0, a quarter start with 00, an eighth start with 000, and so on.

If you've seen a hash with 20 leading zeros, you've probably seen around 2²⁰ ≈ 1 million unique items. That pattern is rare enough that it only appears after processing many items.

How it works

Hash each item to a 64-bit value
Use first 14 bits to pick one of 16,384 registers (buckets)
Count leading zeros in the remaining bits
Store the maximum leading zero count per register
Estimate cardinality using a harmonic mean across all registers

Add "user_123":
  Hash → 64 bits
  First 14 bits → Register 5,782
  Remaining bits → 7 leading zeros
  registers[5782] = max(old_value, 7+1) = 8

Duplicates don't change anything - same item always hashes the same way. That's how it counts unique items without storing them.

Memory vs. accuracy

Registers	Memory	Standard Error
1,024	5 KB	3.25%
16,384	12 KB	0.81%
65,536	48 KB	0.41%

Redis defaults to 16,384 registers. If the true count is 1,000,000, it'll report between ~991,900 and ~1,008,100.

Where it's used

Redis: PFADD / PFCOUNT - 12 KB per key, any number of elements
Google BigQuery: APPROX_COUNT_DISTINCT() - orders of magnitude faster than exact count
Apache Spark: approx_count_distinct() for distributed datasets
Presto/Trino, Druid, Elasticsearch: All use HyperLogLog internally

Merging for distributed counting

HyperLogLog sketches merge by taking the max per register:

for i in range(num_registers):
    merged.registers[i] = max(hll1.registers[i], hll2.registers[i])

Each shard counts locally, merge at the end. No raw data transfer needed.

📖 Deep dive: Full algorithm walkthrough with bias correction, small/large range corrections, and implementation - How HyperLogLog Works

Comparison: Which One Do You Need?

	Bloom Filter	Count-Min Sketch	HyperLogLog
Answers	"Is X in the set?"	"How often did X appear?"	"How many unique items?"
Error type	False positives only	Overestimates only	± percentage error
Memory	~1.2 MB per 1M items	~100 KB (fixed)	~12 KB (fixed)
Mergeable?	OR bit arrays	Add counters	Max per register
Deletable?	No (standard)	No (standard)	N/A
Used by	Cassandra, Chrome	Redis, PostgreSQL	Redis, BigQuery, Spark

Decision flowchart

Ask yourself what question you need answered:

"Is this item in the set?" → Bloom Filter
"How many times has this item appeared?" → Count-Min Sketch
"How many unique items are there?" → HyperLogLog
"I need exact answers" → Use a hash map/set (and accept the memory cost)

The Common Thread

All three structures share the same philosophy:

Trade accuracy for memory - accept small, bounded errors to achieve orders-of-magnitude memory savings
Use hash functions - map inputs uniformly to extract statistical properties
Support merging - combine results from distributed nodes without raw data transfer
Never store the actual items - they answer questions about data without remembering the data itself

This is why they appear in virtually every large-scale system. When you're processing billions of items, exact answers often aren't worth the memory cost.

DEV Community

3 Data Structures Every Backend Engineer Should Know (That Aren't Trees or Hash Maps)

1. Bloom Filter - Membership Testing

How it works

Why it matters

Quick implementation (Python)

2. Count-Min Sketch - Frequency Estimation

How it works

Where it's used

The killer feature: mergeability

3. HyperLogLog - Cardinality Estimation

The core insight

How it works

Memory vs. accuracy

Where it's used

Merging for distributed counting

Comparison: Which One Do You Need?

Decision flowchart

The Common Thread

What to Read Next

Top comments (0)