Tyler Tan

Posted on Jun 11

From 178ms to 1ms: When Store-to-Load Forwarding Stalls Your For Loop

#computerscience #cpp #performance #programming

Take a look at this C++ snippet. Spend half a minute reading it, then take a guess — how long do you think it takes to run?

constexpr int N  = 1'000'000;
constexpr int NG = 262144;

auto* buf = new uint64_t[NG];
std::memset(buf, 0xfe, NG * 8);          // fill with 0xfe, meaning "empty slot"

for (int i = 0; i < N; ++i) {
  int g       = (i >> 7) & (NG - 1);     // first 128 iterations g=0, next 128 g=1, and so on
  uint64_t w  = buf[g];                  // read 64-bit word

  uint64_t fb = w & 0x8080808080808080ULL;
  uint64_t e  = ( (w ^ 0xfefefefefefefefeULL) - 0x0101010101010101ULL )
              & ~(w ^ 0xfefefefefefefefeULL) & 0x8080808080808080ULL;
  int pos = fb ? (std::countr_zero(fb) / 8) : 0;

  ((uint8_t*)buf)[g * 8 + pos] = (i & 0x7f);  // write 1 byte
}

The program looks a bit odd, but it has a real-world use case (revealed at the end). What it does is actually straightforward:

There's a uint64_t array (262144 elements), all initialized to 0xfe
The loop runs 1 million times. The traversal pattern is simple: the first 128 iterations read/write buf[0], the next 128 read/write buf[1], then 128 on buf[2]... advancing every 128 iterations. From a cache perspective, this is very friendly — repeatedly hitting the same cache line
Each iteration: read a 64-bit value, do some bitwise operations, then modify one byte and write it back

The loop itself advances i from 0 to N-1 sequentially, and g increments accordingly — one group at a time before switching to the next. The prefetcher has an easy time predicting this. From temporal locality, spatial locality, to hardware prefetching, everything looks almost flawless.

With that in mind, let's estimate: each iteration does one mov load, a few ALU bitwise ops, one mov store. The bitwise ops barely consume extra cycles under superscalar scheduling. Assuming all stores hit L1 cache, each iteration should take about 5–8 cycles. 1 million iterations × 6 cycles ÷ 2.5 GHz ≈ 2.4 ms.

Let's actually run it:

Time: 178 ms

178 milliseconds. That's an average of ~178 ns per iteration. On the M5 chip I used (running at 2.5 GHz), that works out to about 445 clock cycles per iteration.

Thinking...

If each iteration only does one load, a few bitwise ops, and one store, why on earth would it cost 445 cycles?

False sharing? No — this is a single-threaded program; the cache coherence protocol isn't involved at all. False sharing doesn't apply here.

Cache misses? No — 128 consecutive iterations all operate on the same slot, repeatedly hitting L1d. The cache hit rate couldn't be higher.

Branch prediction? No — the loop body has no unpredictable if statements. The iteration count is known at compile time. The branch predictor is under zero pressure.

TLB miss then? No — the entire array is only 2 MB, easily fitting in L2, with TLB coverage to spare.

Keep thinking...

After ruling out each candidate one by one, an inconspicuous detail suddenly stands out — every iteration reads a full uint64_t, but modifies only one byte, then writes that changed byte back.

Read → modify 1 byte → write back → next iteration reads the same 64-bit word.

Wait.

When the next iteration executes w = buf[g], where is the byte we just wrote? It's still in the Store Buffer.

Why is there a Store Buffer? Because store instructions are too slow. A store needs to write data into L1d, but the target cache line might not even be in L1d — it could be held by another core, requiring an RFO (Request For Ownership) handshake, costing hundreds of cycles round-trip. Without a Store Buffer, the pipeline would stall — the store can't complete, and everything behind it grinds to a halt. The Store Buffer exists precisely to decouple the pipeline from the slow process of writing back to L1d: the store drops the data in and calls it done, the pipeline marches on, and the buffer drains asynchronously in the background.

Think of the Store Buffer as a "staging area" for L1d — data written into it sits there temporarily, not yet committed. Meanwhile, it has a bonus ability called Store-to-Load Forwarding: if a subsequent load happens to hit a store that hasn't drained from the buffer yet, the CPU forwards the latest value directly to the load, bypassing L1d entirely.

Sounds perfect, right? We just wrote a byte, and the next read gets it forwarded directly — no waiting at all.

But the problem is exactly in that "forwarding" step.

Our store wrote only 1 byte. The next load wants to read 8 bytes (the full 64-bit word). The store buffer only has the new value for that one modified byte — the other 7 bytes are still in L1d. The CPU can't simply forward 1 byte from the store buffer to satisfy an 8-byte load — "your data is incomplete; where do I get the missing 7 bytes?"

What the CPU does here is called Partial Store-to-Load Forwarding. Instead of a simple forward, it:

Stalls, waiting for that byte store to drain into L1d
Reads back the full 64-bit value from L1d (1 byte newly written, 7 bytes original)
Hands the read-back value to the load instruction

On Intel/AMD, this path costs 5–8 extra cycles. On Apple Silicon, although the store buffer is deeper, partial forwarding is just as slow. Every iteration waits for this "1 byte write → 8 byte read" merge to complete — and over 1 million iterations, that adds up to 178ms.

Attempt 1: Full-Word Write

Since byte→word forwarding is slow, let's try a different approach — don't write 1 byte; instead, construct the full 64-bit value and write it back in one shot (full-word RMW):

// Before (byte store):
((uint8_t*)buf)[g * 8 + pos] = (i & 0x7f);

// After (full-word store):
uint64_t new_w = (w & ~(0xFFULL << (pos * 8)))
               | ((uint64_t)(i & 0x7f) << (pos * 8));
buf[g] = new_w;

Result:

Slow (byte store) : 178 ms
Full-word RMW     :  75 ms    ← 2.4× faster

Some improvement, but still not fast enough — because the next iteration still reads and writes the same array element. Even though word→word forwarding is faster than byte→word, the CPU is still doing read-modify-write on the same address over and over:

store → store buffer → load hits store buffer (forwarding) → modify the read value → another store enters the store buffer

This dependency chain serializes the iterations, rendering out-of-order execution almost useless. Under data dependency, the CPU can only wait.

The Real Fix: Breaking the Dependency Chain

The core problem is simple: each iteration reads what the previous iteration just wrote. If we make consecutive iterations access different array elements, the dependency chain disappears.

The fix — when computing the index g, don't preserve low-bit sequentiality after hashing. Add a bit mixer to the hash value, scattering consecutive inputs across the full space:

// Before (sequential index):
int g = (i >> 7) & (NG - 1);

// After (hash mixer scatters input):
auto mix = [](uint64_t h) {
    h ^= h >> 33; h *= 0xff51afd7ed558ccdULL;
    h ^= h >> 33; h *= 0xc4ceb9fe1a85ec53ULL;
    h ^= h >> 33; return h;
};
int g = mix(i) & (NG - 1);

The mixer is MurmurHash3's finalizer — three XORs, two multiplications, three shifts, all done within 5 ns. But because mix(i) and mix(i+1) produce completely uncorrelated values, consecutive iterations almost certainly access different array elements.

Now the result:

Slow (sequential) : 178 ms
Fast (hashed)     :   1 ms   ← 178× faster

1ms. The bottleneck drops from 445 cycles/iteration to about 3 cycles.

The explanation is simple — once the mixer scatters the writes:

Iteration i writes to buf[A], data enters the store buffer
Iteration i+1 reads buf[B] (A ≠ B), hitting L1d directly — because buf[B] is not on the store buffer's forwarding path, and the last time it was modified was many iterations ago; the store buffer has long since drained it into L1d
An L1d hit costs only 3–4 cycles, an order of magnitude cheaper than the dozen-plus cycles of byte→word forwarding

Note: the mixer is a bijection — it won't map two different keys to the same hash. It merely permutes the input space without introducing any collisions. Zero impact on correctness, all benefit for performance.

Real-World Application: Swiss Table Performance Optimization

This is not a contrived toy example. This benchmark captures a real bottleneck I encountered while implementing Swiss Table (the underlying algorithm of Go's built-in map).

The test scenario: sequentially inserting 1 million integer key-value pairs.

The initial version used std::hash<int> (which on aarch64 is equivalent to identity — i.e., hash(x) = x). Since consecutive keys produce consecutive hashes, the low 7 bits (h₂) cycle while the high bits (h₁) stay unchanged. This causes every 128 consecutive keys to land in the same "control word group" — exactly the situation in the demo where g stays fixed.

Result:

Inserting 1 million entries: 438 ms
std::unordered_map doing the same: 12 ms

After adding a mixer:

Inserting 1 million entries: 10 ms (beats std)
All benchmarks improved significantly

Scenario	Before	After	std baseline
Sequential insert	438ms	10ms	12ms
Random insert	226ms	9ms	35ms
Mixed operations	834ms	53ms	70ms
Dense iteration	5ms	19ms	12ms

Not a better algorithm. Not a redesigned storage layout. Just one thing — a 5-line bit mixer tacked onto the hash, completely eliminating the RAW store-to-load forwarding bottleneck triggered by consecutive keys.

Conclusion

Decades ago, people said "optimization is just finding a faster algorithm." But on modern superscalar, deeply pipelined CPUs with store buffers and out-of-order execution, what the hardware actually goes through when executing your program often matters more than big-O complexity.

What does a byte store look like under the hood? Does a load go through the store buffer or straight to L1d? Will consecutive iterations hitting the same address stall the out-of-order window? If you can answer these questions, a few lines of code can unlock hundredfold performance gains.

And that's exactly what the "Foundations of Low-Level Systems" series aims to achieve.

For a deeper dive into how the Store Buffer works, see the companion article Cache Series V — Write Policies, Store Buffer, and Memory Barriers, which explains from the hardware level how store buffers decouple from the pipeline, the forwarding paths of store-to-load forwarding, and the differences in memory ordering models across ISAs.