Dmitriy

Posted on Feb 16

A year-long hunt for a Linux kernel bug, or the unexpected zeros from XFS

#postgres #postgressql #linux #bug

You’ve probably had this happen too: a service runs smoothly, keeps users happy with its stability and performance, and your monitoring stays reassuringly green. Then, the next moment — boom, it’s gone. You panic, dive into the error logs, and find either a vague segfault or nothing at all. What to do is unclear, and production needs saving, so you bring it back up — and everything works just like before. You try to investigate what happened, but over time you switch to other tasks, and the incident fades into the background or gets forgotten entirely.

That’s all fine when you’re on your own. But once you have many customers, sooner or later you start feeling that something isn’t right, and that you need to dig into these spikes of entropy to find the root cause of such incidents.

This article describes our year-long investigation. You’ll learn why PostgreSQL (and any other application) can crash because of a bug in the Linux kernel, what XFS has to do with it, and why memory reclamation might not be as helpful as you thought.

How it started

Of course we cannot tell you all the internal details because of strict NDAs, but trust us: you heard the bang of the crashes that started this story.

We would occasionally get reports from clients (and not only them): "It worked, then crashed for no visible reason. We brought it back up, it keeps working, but we want to understand what that was". Not mass reports, not frequent, just sometimes. Nobody really understood what broke, because there were no obvious reasons. It was fine until enough of these reports piled up and it became clear we had to solve this mystery.

So what did we have at time zero after a typical report from not-so-happy customers?

Clusters (and standalone installs) of PostgreSQL different versions, vanilla and not, crashing loudly. Some with segfaults, some with just a vanished process
Servers running different Linux kernel versions
A "zoo" of installed software with very different settings, fields, and flags, including kernel parameters
Logs with errors about invalid memory addresses
Attempts to jump into protected memory segments
Attempts to execute illegal instructions
Crashes at random moments without obvious causes you could systematize. You could almost start looking for correlations with moon phases or planetary alignment.

There was one more important commonality: dump analysis and diagnostics pointed to anything except PostgreSQL. And while we are database developers, professional curiosity won, so we decided to go all the way.

Looking for clues

So we had a classic heisenbug, impossible to catch or reproduce in an obvious way. The worst part: real production systems were affected, not lab stands. We had to do a deep, exhaustive analysis of everything, testing even the wildest hypotheses.

The starting point was an observation that in some cases the system journal showed fragments of a write-protected code segment that contained only zeros instead of real instructions. This indirectly pointed to hardware or OS-level problems: we were reading what should be valid data from memory and getting unexpected zeros.

As we chased this version, slowly but surely we found the full instruction sequence that led to the error. It helped that similar errors appeared in logs of other applications that also crashed on the same systems. They were just less critical than the database, so either nobody noticed the crashes or a nearby service auto-restarted them.

On the other hand, the problem appeared on servers with very different configurations, so we shifted focus from hardware issues to OS behavior during crashes. That meant fun things like digging into memory management code and driver analysis. And it was not just hard; we ran into another factor that limited our analysis. Until that moment all diagnostics were collected at the user-space process level, and the bug did not reproduce frequently enough to quickly get fresh data for each new hypothesis. Also, collecting detailed kernel-level info is not enabled on every production system. Although, if you look at it through an infrastructure admin's eyes, these constraints are not that bad.

Further investigation revealed another pattern: the higher the memory pressure, the higher the probability of the bug manifesting and crashing the system. For example, running several large memory tests in parallel increased the probability dramatically. With a partner who owns a large testing hardware pool, we were able to create the first scripts to reproduce the situation artificially.

So what actually happened?

We will talk about concrete reproduction scenarios below, but first let's understand what happened. We confirmed that the root cause is reading incorrect zeroed data from memory pages that map files (mmap). It does not matter whether we read data or execute code. So the bug was indeed at the OS kernel level.

Further research showed that the file mapped into memory must be on the XFS file system. And memory page clearing on free must be enabled. That means the memory subsystem must be under heavy pressure so the reclaim mechanism actively frees pages.

mmap is a mechanism for mapping files into memory. It appeared in UNIX in the early 1990s and was implemented in the Linux kernel around the same time. Its story started in BSD systems in the 1980s, so this is core Linux functionality, not something new. That is why mmap is used heavily both in the kernel and in applications. Even if our application does not call it explicitly, it can still hit the bug, because the OS uses it to load executable code into memory. If executable code or a dynamic library (even plain libc) resides on the affected filesystem, then at the moment of reproduction the CPU can read zeros instead of real instructions, crashing the application.

We tested different hypotheses and found the issue reproduces with different mapping parameters: Read-Only and Read-Write, Shared and Private. In short, the full basket of happiness.

Reproduction also requires the mechanism that zeroes freed memory pages to be enabled. Normally, physical pages are cleared when allocated to an application, but if they are only freed, their content is preserved. This requirement appears in systems with high data-compromise risk requirements, which regulators around the world love. Starting with kernel version 5.3, the init_on_free parameter was introduced, and its default is set by the kernel build option CONFIG_INIT_ON_FREE_DEFAULT_ON. Pro tip: if you do not have bootloader access, you can confirm this mechanism is active by searching logs for mem auto-init:

mem auto-init: stack:off, heap alloc:off, heap free:on
mem auto-init: clearing system memory may take some time…

Now about the XFS mentioned in the title and why this starts with the filesystem. It is not simple: we could not find a single definitive cause. The most likely reason the bug triggers when files are on XFS is page folio, the successor of struct page. If you have not run into it, it is a memory management concept designed to simplify huge pages and compound pages. In the affected kernel versions, it was not yet supported by other filesystems, and the commits below are tied to the page folio implementation.

Finally, the last factor that leads to the crash is heavy memory subsystem load. When the system actively allocates free memory, cached pages can be evicted. We reproduced this state by allocating memory in small blocks until free space was exhausted. After that it does not matter whether the OOM Killer triggers or the process is simply stopped. This proved that the OOM Killer itself is not to blame.

So yes, the bug is complex and OS behavior is unpredictable. Any application can crash when it tries to modify a protected memory area or execute code from a segment without execute permission.

Our path to reproduction

Because the bug was unstable, we decided to focus on one synthetic scenario that would test the maximum number of hypotheses with an unambiguous result. The test intentionally includes logic that leads to process termination due to memory exhaustion, but it is not guaranteed that the bug only appears under these conditions.

Our test stand is Debian 11/12 with gcc and xfsprogs installed. The VM gets 4 CPU cores, 4 GB of RAM, and 20 GB of disk. The system partition is formatted as ext4. The bootloader is grub with options init_on_free=1 and transparent_hugepage=never (to exclude the impact of THP).

~# grep GRUB_CMDLINE_LINUX_DEFAULT /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet init_on_free=1 transparent_hugepage=never"
~# update-grub
<restart>
~# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.1.114 root=UUID=56af771d-b0a2-45af-bc53-66f4ca792577 ro quiet init_on_free=1 transparent_hugepage=never

For the test we mount a 384 MB XFS test partition on top of a file on the system partition:

~# dd if=/dev/zero of=xfs.file bs=1M count=384
~# mkfs.xfs -f xfs.file
~# mkdir xfs.mnt
~# mount -t xfs xfs.file xfs.mnt

We decided to reproduce the bug in two ways:

Show that an application can get incorrect data from a memory-mapped file.
Get a segfault in a correct application due to the CPU reading incorrect values from the code segment.

In both cases the trigger is running a test program in parallel that continuously allocates memory in 1 KB chunks until the OOM Killer kills it.

Straightforward, like a railway sleeper, code:

include <stdlib.h>
include <stdio.h>
int main(int argc, char* argv[]) {
for (;;) {
if (malloc(1024) == NULL) {
printf("Could not allocate memory\n");
return 1;
}
}
}

We also need to map a file located on XFS into memory. We prepare a 100*4096-byte sample because it is convenient:

~# dd if=/dev/zero bs=4096 count=100 | tr '\0' '\1' > xfs.mnt/test_file

And a simple program to map it into memory with this algorithm: in a loop, read each 4 KB block and check that the file contains 1.

include <fcntl.h>
include <stdio.h>
include <stdlib.h>
include <sys/mman.h>
include <unistd.h>

int main(int argc, char* argv[]) {
char* filename="xfs.mnt/test_file";
int fd = open(filename, O_RDONLY);
if (fd == -1) {
printf("Could not open file %s\n", filename);
exit(EXIT_FAILURE);
}
char* map = mmap(NULL, 4096 * 100, PROT_READ, MAP_PRIVATE, fd, 0);
if (map == MAP_FAILED) {
close(fd);
printf("Could not map file content %s\n", filename);
exit(EXIT_FAILURE);
}

printf("Starting test...\n");
/* Test body */
for(;;) {
for (int i = 0; i < 100; i++) {
char c = map[i*4096];
if (c != 1) {
printf("Got invalid value on page %i = %i\n", i, c);
}
}
}

munmap(map, 4096 * 100);
close(fd);
return EXIT_SUCCESS;
}

Now we run all programs until the test starts returning zeros instead of ones. That means primary success is achieved and the next task is to get a segfault. For that we write one more test that floods calls to "empty" functions and checks their return values:

include <stdio.h>
include <stdlib.h>

define FUNC_HERE(FUNC_NAME, RET_VAL) int FUNC_NAME() {\
return RET_VAL;\
}

define CHECK_FUNC(FUNC_NAME, RET_VAL) { \
int check_val = FUNC_NAME();\
if (check_val != RET_VAL) {\
printf("Erroneous value from " #FUNC_NAME ", expected %d, received %d\n", RET_VAL, check_val);\
exit(1);\
}\
}

FUNC_HERE(func0,   0)
FUNC_HERE(func1,   1)
FUNC_HERE(func2,   2)
FUNC_HERE(func3,   3)
FUNC_HERE(func4,   4)
FUNC_HERE(func5,   5)
FUNC_HERE(func6,   6)
FUNC_HERE(func7 ,  7)
FUNC_HERE(func8,   8)
FUNC_HERE(func9,   9)
FUNC_HERE(func10, 10)
FUNC_HERE(func11, 11)
FUNC_HERE(func12, 12)
FUNC_HERE(func13, 13)
FUNC_HERE(func14, 14)
FUNC_HERE(func15, 15)
FUNC_HERE(func16, 16)
FUNC_HERE(func17, 17)
FUNC_HERE(func18, 18)
FUNC_HERE(func19, 19)
FUNC_HERE(func20, 20)
FUNC_HERE(func21, 21)
FUNC_HERE(func22, 22)
FUNC_HERE(func23, 23)
FUNC_HERE(func24, 24)
FUNC_HERE(func25, 25)
FUNC_HERE(func26, 26)
FUNC_HERE(func27, 27)
FUNC_HERE(func28, 28)
FUNC_HERE(func29, 29)

int main(int argc, char* argv[]) {
printf("Program start\n");

for(;;) {
// Check functions
CHECK_FUNC(func0,   0)
CHECK_FUNC(func1,   1)
CHECK_FUNC(func2,   2)
CHECK_FUNC(func3,   3)
CHECK_FUNC(func4,   4)
CHECK_FUNC(func5,   5)
CHECK_FUNC(func6,   6)
CHECK_FUNC(func7,   7)
CHECK_FUNC(func8,   8)
CHECK_FUNC(func9,   9)
CHECK_FUNC(func10, 10)
CHECK_FUNC(func11, 11)
CHECK_FUNC(func12, 12)
CHECK_FUNC(func13, 13)
CHECK_FUNC(func14, 14)
CHECK_FUNC(func15, 15)
CHECK_FUNC(func16, 16)
CHECK_FUNC(func17, 17)
CHECK_FUNC(func18, 18)
CHECK_FUNC(func19, 19)
CHECK_FUNC(func20, 20)
CHECK_FUNC(func21, 21)
CHECK_FUNC(func22, 22)
CHECK_FUNC(func23, 23)
CHECK_FUNC(func24, 24)
CHECK_FUNC(func25, 25)
CHECK_FUNC(func26, 26)
CHECK_FUNC(func27, 27)
CHECK_FUNC(func28, 28)
CHECK_FUNC(func29, 29)
}

printf("Program end\n");
}

This thing is compiled with -fno-inline -falign-functions=4096, to ensure correct alignment of functions to different memory pages. That means each function sits on its own page. The resulting binary is placed on the XFS volume where the source files live. If our theory is right, after a few runs we get a segfault. The system journal shows an entry like:

~# dmesg
...
[ 1160.680986] test2[590]: segfault at 0 ip 00005645b312e000 sp 00007ffdb8a95758 error 6 in test2[5645b312b000+21000] likely on CPU 1 (core 1, socket 0)
[ 1160.681007] Code: Unable to access opcode bytes at 0x5645b312dfd6.
...

If everything is done correctly, the result arrives quickly and you can go drink champagne (or take insomnia pills to sleep after what you have seen). You just confirmed a Linux kernel bug that leads to data loss.

Your tests are nice, but how do I diagnose my system?

A very valid question! Because the problem can appear anywhere in code, symptoms can be almost anything, and deep memory dump analysis is not available to every engineer. Also, not everyone has sufficient logging enabled, to be honest. But no gloom - there is a set of typical symptoms that can make life easier.

First, dmesg will likely show a clear segfault:

~# dmesg
...
[ 1160.680986] test2[590]: segfault at 0 ip 00005645b312e000 sp 00007ffdb8a95758 error 6 in test2[5645b312b000+21000] likely on CPU 1 (core 1, socket 0)
[ 1160.681007] Code: Unable to access opcode bytes at 0x5645b312dfd6.
...

In some cases the error may not be a segmentation error but an attempt to execute an invalid instruction — invalid opcode.

traps: postgres[1923471] trap invalid opcode ip:7f71444c0003 sp:7ffcb3ec4320 error:0 in liblz4.so.1.9.3[7f71444b7000+1b000]

Another characteristic sign is zero values in the code segment after a segfault:

[<timestamp>] postgres[24237]: segfault at cb1 ip 00005579a1505130 sp 00007ffe4f32d008 error 6 in postgres[5579a1170000+559000] likely on CPU 49 (core 17, socket 1)
[<date>] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Remember, though, it does not always fire this way. The exit code can be perfectly normal. So do not make sudden, rash decisions.

How bad is this globally?

Our checks showed that the bug exists starting from Linux kernel 5.18 and stops reproducing from 6.9. That hints at a fairly wide spread.

The place of breakage led us to this commit ("mm/readahead: Switch to page_cache_ra_order"). And it was fixed around this patch discussion for 6.9. Specifically after this commit ("mm: free folios in a batch in shrink_folio_list()").

Summary

Let's wrap up with a classic self-checklist and a few useful tips to sleep a little calmer:

Affected kernels are Linux 5.18 to 6.8
The issue manifests if init_on_free is enabled in the kernel
The issue manifests if the filesystem is XFS

What if you have all three? The corporate answer is "contact your OS vendor," because realistically only they can help you. It is possible your vendor already fixed it in your specific kernel and you do not need to do anything.

If you are not so lucky or you are waiting for a response or a patch, one option is to migrate to another filesystem (hello ext4), or set init_on_free=0 in Linux boot arguments. But of course, do it at your own risk.