I stripped my code down to do absolutely nothing. Just count events and move on. It got 8% slower.
This isn't measurement noise. Over 30 seconds of processing 12+ million events across 5 test runs, the "optimized" version was consistently, measurably slower than the version doing expensive kernel symbol lookups and string formatting.
The Setup
I built an eBPF tool that monitors TCP packet drops in the Linux kernel. eBPF (Extended Berkeley Packet Filter) lets you hook into kernel functions without modifying kernel source. When a packet is dropped, my eBPF program captures the event (PID, drop reason, kernel function) and sends it to userspace through a ring buffer.
During stress testing, I flooded my machine with SYN packets:
sudo hping3 -S -p 80 --flood
The kernel started dropping ~400,000 packets per second. My tool reads each drop event from the ring buffer and processes it.
I created four benchmark modes to find the bottleneck:
- Benchmark - Just count events. Zero processing.
- Busy - Do expensive work (symbol lookups, string formatting), then discard the result.
- File - Same expensive work, but write to a file.
- Terminal - Same expensive work, print to terminal.
Each mode ran for 30 seconds. Here's what happened:
| Mode | Events Read | Throughput | Rank |
|---|---|---|---|
| File | 12,548,707 | 373,828/sec | 1 |
| Busy | 12,432,234 | 370,316/sec | 2 |
| Benchmark | 11,570,132 | 344,616/sec | 3 |
| Terminal | 653,848 | 19,353/sec | 4 |
The mode doing zero work came in third.
The Paradox
Benchmark Mode should have won. It's literally just incrementing a counter:
// Benchmark Mode - the "fast" path
for {
record, err := ringbuf.Read()
if err != nil { break }
metrics.EventsRead.Add(1)
}
No symbol lookups. No string formatting. No I/O. Just atomic increment and repeat.
But it lost by 8% to Busy Mode, which does this:
// Busy Mode
func (p *EventProcessor) ProcessEventBusy(event *monitorEvent) {
p.metrics.EventsRead.Add(1)
// Map lookup for drop reason
reasonStr := p.dropReasons[event.Reason]
if reasonStr == "" {
reasonStr = fmt.Sprintf("UNKNOWN(%d)", event.Reason)
}
// Binary search through 300k kernel symbols
symbolName := findNearestSymbol(event.Location)
if symbolName == "" {
symbolName = fmt.Sprintf("0x%x", event.Location)
}
// String formatting with allocations
_ = fmt.Sprintf("[%s] Drop | PID: %-6d | Reason: %-18s | Function: %s\n",
time.Now().Format("15:04:05"),
event.Pid,
reasonStr,
symbolName)
p.metrics.EventsPrinted.Add(1)
}
How does doing more work make code faster?
Ruling Out the Obvious
Memory Allocation?
Benchmark Mode:
- Total allocated: 303 MB
- GC runs: 14
Busy Mode:
- Total allocated: 2,592 MB (8x more)
- GC runs: 75 (5x more)
Busy Mode was doing 5x more garbage collection and still winning. The bottleneck wasn't memory.
The Smoking Gun: CPU Profiling
I added CPU profiling with pprof to both modes. The difference was stark.
Benchmark Mode (344k events/sec)
unix.EpollWait: 11.12s (76%) ← Blocking on syscalls
ringbuf.Read: 13.61s (93%) ← Total time in read
76% of CPU time spent waiting in epoll_wait(), blocked while the kernel writes the next event.
Busy Mode (370k events/sec)
ProcessEventBusy: 11.27s (54%) ← Actually doing work
├─ fmt.Sprintf: 6.53s (31%)
├─ findNearestSymbol: 2.61s (13%)
└─ mallocgc: 3.19s (15%)
unix.EpollWait: 5.57s (27%) ← Much less waiting
Only 27% waiting. The rest was productive work.
File Mode (373k events/sec)
ProcessEvent: 11.66s (56%) ← Work
├─ fmt.Fprintf: 4.95s (24%) ← Cheaper than Sprintf!
├─ findNearestSymbol: 2.68s (13%)
└─ mallocgc: 2.64s (13%)
unix.EpollWait: 5.87s (28%) ← Same batching as Busy
File and Busy have identical syscall overhead (28% waiting). File edges ahead because fmt.Fprintf() to a buffered file is more efficient than fmt.Sprintf() creating throwaway strings.
Why This Happens: The Syscall Tax
The kernel generates events at ~2.5µs per event (400k/sec).
Benchmark Mode processes each event in ~1µs:
Event arrives → Process (1µs) → ringbuf.Read()
↓
Ring buffer empty!
↓
epoll_wait() blocks (context switch)
↓
Kernel writes next event (1.5µs)
↓
Wake up userspace
↓
Repeat
Result: ~400,000 context switches per second, constant blocking.
Benchmark Mode was too fast. It kept asking for the next event before the kernel had written it, forcing the process to sleep in epoll_wait() waiting for data.
Busy Mode processes each event in ~4µs:
Event arrives → Process (4µs) → ringbuf.Read()
↑ ↓
| 3 events already waiting!
| ↓
└──────── Kernel wrote more while we were busy
Result: ~150,000 context switches per second, natural batching.
By the time userspace calls Read(), multiple events are already queued in the ring buffer. No blocking needed.
Accidental Batching
Busy Mode's processing time (~4µs per event) accidentally created natural batching. While userspace was busy formatting strings and looking up symbols, the kernel queued multiple events. Each ringbuf.Read() call pulled several events without blocking.
Benchmark Mode outpaced its data source and spent most of its time context-switching between userspace and kernel, waiting for the next event.
The paradox: Removing work made the code too fast, causing it to waste time waiting.
The Real Cost of Performance
Here's what's actually expensive in event-driven programming:
Cheap:
- Memory allocation (even 8x more)
- Garbage collection (even 5x more)
- CPU work (symbol lookups, string formatting)
Expensive:
- Context switches (crossing the kernel/userspace boundary)
- Blocking syscalls (
epoll_wait,read,poll)
When you optimize CPU work to near-zero, you don't eliminate the cost, you just shift it to I/O overhead. If your processing loop is faster than your data source, you end up paying the syscall tax on every single event.
Who Should Care?
This pattern appears in any high-frequency event processing:
- Network packet processing (eBPF, DPDK, raw sockets)
- Financial trading systems (market data feeds)
- Log aggregation (reading from message queues)
- Metrics collection (statsd, Prometheus exporters)
- Game engines (input event processing)
Rule of thumb: If your event processing is faster than your event arrival rate, you're probably paying unnecessary syscall overhead. Profile your code. If you see >50% time in epoll_wait/poll/select, you're thrashing on syscalls. Batch your reads.
The Actual Fix
The real lesson here isn't "batch your reads better in userspace." The cilium/ebpf library's ringbuf.Read() is already reasonably efficient. You're still bound by the poll/epoll cycle regardless.
The actual fix is to stop sending 400,000 events to userspace in the first place.
This is where eBPF's real power comes in: move the aggregation logic into the kernel.
Instead of:
Kernel: Drop packet → Send event to ring buffer
Userspace: Read event → Process → Count
Result: 400,000 context switches/sec
The proper approach:
Kernel: Drop packet → Increment counter in BPF map (no userspace trip!)
Userspace: Read aggregated counts once per second
Result: 1 context switch/sec
Using eBPF maps, I can aggregate packet drops directly in kernel memory, count drops by kernel function, by IP address, by drop reason; and pull the summary to userspace periodically instead of streaming 400,000 individual events.
Expected improvement: 99.9% reduction in context switches (from 400k/sec to ~1/sec).
This is future work for me. The interesting benchmark results I found were educational. They taught me about syscall overhead and accidental batching, but they also revealed I was solving the problem in the wrong place.
The Takeaway
The fastest code isn't always the code doing the least work, it's the code that minimizes expensive operations.
In this case:
- Benchmark Mode: Optimized CPU work, paid 76% overhead in syscalls
- Busy Mode: Did 8x more allocation and 5x more GC, reduced syscall overhead to 27%
- File Mode: Most efficient I/O primitives, same batching benefits
The real enemy wasn't symbol lookups or string formatting. It was calling the kernel 400,000 times per second instead of 150,000 times.
Three lessons:
- Profile before optimizing - My intuition said "remove work." The profiler said "syscalls are the bottleneck."
- Batching beats speed - Reading 10 events with 1 syscall is faster than reading 10 events with 10 syscalls.
- There's a sweet spot - Too-fast processing just means waiting for I/O.
The bottleneck is rarely where you think it is.
Environment: Ubuntu 24.04, Linux 6.5, Go 1.21 (variance <2% across 5 runs)
Full code: GitHub
Profiling implementation: benchmark/investigation branch
Previous article: My Logs Lied: How I Used eBPF to Find the Truth





Top comments (0)