Category Archives: Programming

Linux Pipes are Slow

Errata: Some significant mistakes were pointed out to me by email by Brendan MacDonell. I have included errata, but the results might not be reliable, so take this with pinch of salt!

vmsplice is too fast

Some programs use a particular system call “vmsplice” to move data faster through a pipe. Francesco already did a deep dive on using vmsplice to make things fast. However, while experimenting with it, I noticed that, when not using vmsplice, Linux pipes are slower than what I would have expected. Since you cannot always use it, I wanted to understand exactly why that was, and whether it could be improved.

The reason I want to move data through pipes is that I am writing a program encode/decode Morse code blazingly fast.

To get a point of reference, the obvious candidate is the Fizz Buzz throughput competition at the Code Golf StackExchange. There are two kinds of solutions:

  1. the ones that manage to reach up to a few gigabytes per second, with neil’s reaching 8.4 GiB/s;
  2. the ones which largely surpass that, from tkluck’s at 15.5 GiB/s to ais523’s at 60.8 GiB/s, to david’s at 208.3 GiB/s using multiple cores.

The difference between the first and the second group is that the second is using vmsplice, while the first is not1. But how can using vmsplice enable such a large gain in performance? My intuition about vmsplice is that it allows you to avoid copying data to and from kernel space. Surely, copying data cannot be slower than generating it? Even assuming it is not faster, and that you have to copy the data twice to get it through the pipe, you would assume a throughput gain of 3×, at best. But here, we have 7, even just looking at single-core solutions.

Something is missing in my mental model, I want to know what.

First, I’ll need to perform my own measurements to easily compare with what I’ll do afterward. Compiling and running aie523’s solution on my computer2, I get:

$ ./fizzbuzz | pv >/dev/null
96.4GiB 0:00:01 [96.4GiB/s]

With david’s solution, I reach 277 GB/s when using 7 cores (40 GB/s per core).

Now, to understand what’s going on, we need to find the answer to these questions:

  1. How fast can we write data ideally?
  2. How fast can we actually write data to a pipe?
  3. How does vmsplice help?

Writing Data in the Ideal Wonderland

First, let’s consider the program below, which just copies data without doing any system call. I use std::hint::black_box to stop the compiler from noticing that we are not using the result. Without this, the compiler would optimize the program to nothing.

fn main() {
    let dst = [0u8; 1 << 15];
    let src = [0u8; 1 << 15];
    let mut copied = 0;
    while copied < (1000 << 30) {
        std::hint::black_box(dst).copy_from_slice(&src);
        copied += src.len();
    }
}

On my system, this runs at 167 GB/s. This is consistent with the speed of writing to L1 cache for my CPU3.

When profiling this with ftrace, we see that 99.9% of the time is spent in __memset_avx512_unaligned_erms4, directly called by main, and calling no other functions. The flamegraph is pretty much flat. If you do not feel like running a full-fledged profiler, you can just use gdb and hit Ctrl+C at a random time:

$ cargo build --release
$ gdb target/release/copy 
…
(gdb) run
…
^C (hitting Ctrl+C)
Program received signal SIGINT, Interrupt.
__memset_avx512_unaligned_erms () at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:236
…
=> 0x00007ffff7f15dba    f3 aa    rep stos %al,%es:(%rdi)

In any case, note that we are using AVX-512. The implementation is in a generic file dedicated to SIMD vectorization that supports SSE, AVX2 and AVX-512. In our case, the AVX-512 specialization is used.

Errata: The instruction we stop on is rep stos. While faster than an assembly loop, this is definitely not an AVX−512 instructions. I do not know how neither myself nor Hacker News noticed it.

As an aside, note that the implementation of memcpy in glibc uses vm_copy to copy pages directly on Mach-based systems (mostly Apply products) uses a kernel feature to copy pages directly.

However, AVX-512 is quite niche. According to Steam’s hardware survey (section “Other Settings”), only about 12% of Steam users have it. In fact, Intel only included AVX-512 for consumer-grade processors in the 11th generation; and now reserves it for servers. AMD CPUs support AVX-512 since the Ryzen 7000 series (Zen 4).

So I tested this same program while disabling AVX-512. For this, I used the Linux kernel option clearcpuid=304. I was able to check that it used __memset_avx2_unaligned_erms using the gdb and Ctrl+C trick. I then did the same to disable AVX2 with clearcpuid=304,avx2,avx, making it use __memset_sse2_unaligned_erms.

Although SSE2 is always available on x86-64, I also disabled the cpuid bit for SSE2 and SSE to see if it could nudge glibc into using scalar registers to copy data. I immediately got a kernel panic. Ah, well.

When using AVX2, the throughput was… 167 GB/s. When using only SSE2, the throughput was… still 167 GB/s. To an extent, it makes sense: even SSE2 is quite enough to fully use the bus and saturate L1 bandwidth. Using wider registers only helps when performing ALU operations.

The conclusion from this experiment is that, as long as vectorization is used, I should reach 167 GB/s.

Errata: The fact tha this the throughput stays at 167 GB/s is simply due to the fact that rep stos is used in all cases.

Actually Writing Data to a Pipe

So, let’s look at what happen when we write to a pipe instead of to user space memory:

use std::io::Write;
use std::os::fd::FromRawFd;
fn main() {
    let vec = vec![b'\0'; 1 << 15];
    let mut total_written = 0;
    let mut stdout = unsafe { std::fs::File::from_raw_fd(1) };
    while let Ok(n) = stdout.write(&vec) {
        total_written += n;
        if total_written >= (100 << 30) {
            break;
        }
    }
}

We then measure the throughput using:

cargo run --release | pv >/dev/null

On my computer, this reaches 17 GB/s. This is 10 times as slow as just writing to a buffer! How can a system call which basically writes to a kernel buffer be so much slower? And no, context switches don’t take that much time.

So let’s do some profiling of this program.

./zeroes | pv >/dev/null Profiling of ./zeroes Reset Zoom Search ic __x86_return_thunk (4,202,612 samples, 0.01%) exit_to_user_mode_prepare (23,309,116 samples, 0.07%) _raw_spin_lock_irqsave (25,962,963 samples, 0.08%) __x86_return_thunk (32,381,871 samples, 0.09%) try_charge_memcg (846,277,863 samples, 2.46%) tr.. _raw_spin_unlock_irq (288,030,674 samples, 0.84%) check_new_pages (310,465,336 samples, 0.90%) osq_lock (117,185,036 samples, 0.34%) __rcu_read_lock (226,040,208 samples, 0.66%) __GI___libc_write (34,368,952,857 samples, 99.89%) __GI___libc_write <std::os::unix::net::stream::UnixStream as std::io::Write>::write (12,137,904 samples, 0.04%) __hrtimer_run_queues (11,254,762 samples, 0.03%) _raw_spin_unlock_irqrestore (16,388,541 samples, 0.05%) fpregs_assert_state_consistent (4,142,862 samples, 0.01%) vfs_write (33,501,533,082 samples, 97.36%) vfs_write __x86_return_thunk (15,219,222 samples, 0.04%) aa_file_perm (30,051,576 samples, 0.09%) _find_first_bit (57,390,916 samples, 0.17%) __list_del_entry_valid (120,783,301 samples, 0.35%) hrtimer_interrupt (12,658,212 samples, 0.04%) _raw_spin_trylock (959,948,236 samples, 2.79%) _r.. file_update_time (4,074,965 samples, 0.01%) __mod_zone_page_state (32,134,035 samples, 0.09%) _raw_spin_unlock_irqrestore (4,140,183 samples, 0.01%) syscall_exit_to_user_mode (30,099,876 samples, 0.09%) all (34,408,358,444 samples, 100%) __x86_return_thunk (16,313,413 samples, 0.05%) entry_SYSCALL_64 (134,570,535 samples, 0.39%) __sysvec_apic_timer_interrupt (12,658,212 samples, 0.04%) asm_sysvec_apic_timer_interrupt (16,786,362 samples, 0.05%) update_process_times (7,029,218 samples, 0.02%) policy_nodemask (33,222,227 samples, 0.10%) cgroup_rstat_updated (79,560,370 samples, 0.23%) _copy_from_iter (6,966,591,966 samples, 20.25%) _copy_from_iter __list_del_entry_valid (400,133,559 samples, 1.16%) clear_page_erms (4,655,903,634 samples, 13.53%) clear_page_erms update_wall_time (4,225,544 samples, 0.01%) __rcu_read_unlock (6,832,345 samples, 0.02%) refill_stock (17,620,376 samples, 0.05%) ksys_write (33,653,152,549 samples, 97.81%) ksys_write sysvec_apic_timer_interrupt (13,980,036 samples, 0.04%) __memcg_kmem_charge_page (2,947,262,302 samples, 8.57%) __memcg_kmem.. sysvec_apic_timer_interrupt (4,214,752 samples, 0.01%) get_page_from_freelist (8,833,731,051 samples, 25.67%) get_page_from_freelist rw_verify_area (10,800,291 samples, 0.03%) tick_sched_handle (7,029,218 samples, 0.02%) tick_sched_do_timer (4,225,544 samples, 0.01%) __get_obj_cgroup_from_memcg (460,978,258 samples, 1.34%) asm_sysvec_apic_timer_interrupt (4,105,206 samples, 0.01%) __mutex_lock.constprop.0 (8,703,282,714 samples, 25.29%) __mutex_lock.constprop.0 timekeeping_advance (4,225,544 samples, 0.01%) security_file_permission (410,026,776 samples, 1.19%) apparmor_file_permission (317,347,965 samples, 0.92%) zeroes::main (34,396,128,845 samples, 99.96%) zeroes::main alloc_pages (292,194,676 samples, 0.85%) copy_page_from_iter (6,986,960,281 samples, 20.31%) copy_page_from_iter mutex_lock (389,788,556 samples, 1.13%) <std::os::unix::net::stream::UnixStream as std::io::Write>::write (21,805,012 samples, 0.06%) _raw_spin_lock_irqsave (431,015,735 samples, 1.25%) policy_node (119,598,949 samples, 0.35%) __sysvec_apic_timer_interrupt (4,214,752 samples, 0.01%) mutex_spin_on_owner (7,651,102,789 samples, 22.24%) mutex_spin_on_owner mutex_unlock (154,311,955 samples, 0.45%) entry_SYSRETQ_unsafe_stack (4,116,333 samples, 0.01%) __wake_up_common_lock (463,608,242 samples, 1.35%) __mod_memcg_state (156,325,243 samples, 0.45%) __fdget_pos (131,186,857 samples, 0.38%) __rcu_read_unlock (256,610,337 samples, 0.75%) update_vsyscall (4,225,544 samples, 0.01%) __alloc_pages (12,488,057,959 samples, 36.29%) __alloc_pages copy_user_enhanced_fast_string (6,911,972,451 samples, 20.09%) copy_user_enhanced_fast_string mod_memcg_state (483,429,269 samples, 1.40%) __wake_up_common (7,952,263 samples, 0.02%) page_counter_try_charge (100,471,631 samples, 0.29%) __x64_sys_write (103,208,079 samples, 0.30%) syscall_enter_from_user_mode (78,405,130 samples, 0.23%) asm_sysvec_apic_timer_interrupt (5,629,654 samples, 0.02%) osq_unlock (226,312,628 samples, 0.66%) timekeeping_update (4,225,544 samples, 0.01%) zeroes (34,408,355,195 samples, 100.00%) zeroes __list_add_valid (119,183,921 samples, 0.35%) __fget_light (122,843,196 samples, 0.36%) propagate_protected_usage (5,505,752 samples, 0.02%) __cond_resched (11,051,020 samples, 0.03%) _raw_spin_unlock (71,553,131 samples, 0.21%) do_syscall_64 (33,995,562,642 samples, 98.80%) do_syscall_64 memcg_account_kmem (495,700,356 samples, 1.44%) entry_SYSCALL_64_safe_stack (4,103,934 samples, 0.01%) __rcu_read_lock (131,786,573 samples, 0.38%) tick_sched_timer (11,254,762 samples, 0.03%) entry_SYSCALL_64_after_hwframe (34,033,680,043 samples, 98.91%) entry_SYSCALL_64_after_hwframe __get_task_ioprio (65,612,447 samples, 0.19%) rmqueue_bulk (1,769,082,103 samples, 5.14%) rmqueu.. __cond_resched (5,443,361 samples, 0.02%) __rcu_read_unlock (394,985,353 samples, 1.15%) __x86_indirect_thunk_array (11,044,363 samples, 0.03%) pipe_write (32,852,709,554 samples, 95.48%) pipe_write bpf_lsm_file_permission (23,314,251 samples, 0.07%) main (34,408,266,749 samples, 100.00%) main __x86_return_thunk (16,610,051 samples, 0.05%) syscall_return_via_sysret (28,605,764 samples, 0.08%) _raw_spin_lock_irq (1,785,614,251 samples, 5.19%) _raw_s.. __cond_resched (43,770,418 samples, 0.13%)

Note that __GI___libc_write is the glibc wrapper that performs the system call. It and everything below is in user land. Everything above is in the kernel.

As expected, we are spending virtually all our time calling write. In particular, we are spending 95% of our time inside pipe_write. Inside this function, we are spending 36% of our total time in __alloc_pages, which provisions new memory pages for the pipe. We cannot just reuse a handful of pages in a loop because pv moves these pages using splice to /dev/null, which consume them.

Next to it are __mutex_lock.constprop.0 that takes 25% of the time and _raw_spin_lock_irq that takes 5%. They lock the pipe for writing.

This leaves just 20% of the time for the copying of data itself in copy_user_enhanced_fast_string. But, even with only 20% of the CPU time, we would expect to be able to move 167 GB/s * 20% = 33 GB/s. It means that, even taken separately, this function is still twice as slow as __memset_avx512_unaligned_erms, which was used in the program that just wrote to user space memory.

What is copy_user_enhanced_fast_string doing to be so slow? We need to dig deeper. For this, I disassembled my Linux kernel5, and looked at that function.

$ grep -w copy_user_enhanced_fast_string /usr/lib/debug/boot/System.map-6.1.0-18-amd64 
ffffffff819d3d90 T copy_user_enhanced_fast_string
$ objdump -d --start-address=0xffffffff819d3d90 vmlinuz | less   
    
vmlinuz:     file format elf64-x86-64


Disassembly of section .text:

ffffffff819d3d90 <.text+0x9d3d90>:

ffffffff819d3d90:       90                      nop
ffffffff819d3d91:       90                      nop
ffffffff819d3d92:       90                      nop
ffffffff819d3d93:       83 fa 40                cmp    $0x40,%edx
ffffffff819d3d96:       72 48                   jb     0xffffffff819d3de0
ffffffff819d3d98:       89 d1                   mov    %edx,%ecx
ffffffff819d3d9a:       f3 a4                   rep movsb %ds:(%rsi),%es:(%rdi)
ffffffff819d3d9c:       31 c0                   xor    %eax,%eax
ffffffff819d3d9e:       90                      nop
ffffffff819d3d9f:       90                      nop
ffffffff819d3da0:       90                      nop
ffffffff819d3da1:       e9 9a dd 42 00          jmp    0xffffffff81e01b40
...
ffffffff81e01b40:       c3                      ret

The NOP instructions at the beginning and at the end of the function allow ftrace to insert tracing instructions when needed. This lets it collect data about specific kernel function calls without inducing any slow down for kernel functions that are not being profiled. The CPU instruction decoding pipeline takes care of NOP early, so they have basically no impact on performance (other than taking room in the L1i cache).

I do not know why the JMP is not just a RET, however6.

In any case, the CMP test and JB jump handle the case of buffers that are smaller than 64 bytes by jumping to another function that copy 8 bytes at a time with 64-bit registers, then 1 byte at a time with 8 bit register in two loops. For large buffers, the copying is handled by a REP MOV instruction. That’s definitely not vectorized code.

In fact, this function is not implemented in C but directly in Assembly! This means that there is no need to look at the result of compilation; we can just look at the source code. And it’s not just a missed optimization when compiling, it was written like that.

But is the lack of vector instruction the only reason why copy_user_enhanced_fast_string is twice as slow as __memset_avx512_unaligned_erms? To check this, I adapted the initial Rust program to explicitly use REP MOVS:

use std::arch::asm;

fn main() {
    let src = [0u8; 1 << 15];
    let mut dst = [0u8; 1 << 15];
    let mut copied = 0;
    while copied < (1000u64 << 30) {
        unsafe {
            asm!(
                "rep movsb",
                inout("rsi") src.as_ptr() => _,
                inout("rdi") dst.as_mut_ptr() => _,
                inout("ecx") 1 << 15 => _,
            );
        }
        copied += 1 << 15;
    }
}

The throughput is 80 GB/s. This is a factor 2 slow down, exactly what we observe with the kernel function!

Errata: I somehow did not test this particular program with a 16 KiB. Using two 32 KiB saturates the L1 data cache. Using 16 KiB buffers increases the performance to 153 GB/s.

Now, we know that the Linux kernel is not using SIMD to copy memory and that this makes copy_user_enhanced_fast_string twice as slow as it could be.

But why is that? Over at Stack Overflow, Peter Cordes explains that using SSE/AVX instructions is not worth it in most cases, because of the cost of saving and restoring the SIMD context.

In summary: the kernel is spending quite a bit of time on managing memory, and it is not even using SIMD when actually copying the bytes. This is the source of the 10× slow-down we see when comparing with the ideal case.

Errata: the part of the memory management overhead is still true, but this has nothing about using SIMD.

vmsplice to the Rescue

We now have an upper bound (167 GB/s to write the data in memory once) and a lower bound (17 GB/s when using write on a pipe). Let’s look in details at the effect of usng vmsplice. It mitigates the cost of using a pipe by moving entire buffers from user space to the kernel without copying them.

To understand how it works, again, read the excellent article by Francesco. We’ll be using the ./write program from that article to get a minimal example of using vmsplice. This program just writes an infinite number of 'X's. This will simplify the profiling by not having any time dedicated to compute Fizz Buzz data or something else.

./write actually achieves 210 GB/s, well above our upper bound, but that’s because the program is kind of cheating by reusing the same buffers to pass to vmsplice. For anything other than a stream of constant bytes, we will actually have to fill the buffers with new data, which is where the upper bound actually applies. In any case, we only care about what vmsplice does:

./write –write_with_vmsplice –huge_page –busy_loop | ./read –read_with_splice –busy_loop Profiling of ./write Reset Zoom Search ic __iov_iter_get_pages_alloc (3,949,231,497 samples, 14.88%) __iov_iter_get_pages_a.. __fget_light (62,561,929 samples, 0.24%) internal_get_user_pages_fast (3,133,200,714 samples, 11.81%) internal_get_user.. write (26,541,239,090 samples, 100.00%) write __check_object_size (130,746,843 samples, 0.49%) osq_lock (197,341,424 samples, 0.74%) import_iovec (1,716,063,086 samples, 6.47%) import_i.. __import_iovec (1,666,122,831 samples, 6.28%) __import.. do_mmap (2,892,862 samples, 0.01%) iov_iter_advance (538,480,611 samples, 2.03%) i.. pipe_lock (150,603,536 samples, 0.57%) syscall_enter_from_user_mode (4,006,940 samples, 0.02%) kill_fasync (2,756,081 samples, 0.01%) do_syscall_64 (2,892,862 samples, 0.01%) vm_mmap_pgoff (2,892,862 samples, 0.01%) do_syscall_64 (26,020,596,200 samples, 98.04%) do_syscall_64 arch_get_unmapped_area_topdown (2,892,862 samples, 0.01%) mutex_lock (30,726,118 samples, 0.12%) entry_SYSRETQ_unsafe_stack (8,363,819 samples, 0.03%) mutex_spin_on_owner (5,495,045,988 samples, 20.70%) mutex_spin_on_owner update_process_times (2,662,881 samples, 0.01%) __cond_resched (5,668,335 samples, 0.02%) __fdget (18,709,007 samples, 0.07%) osq_unlock (817,653,660 samples, 3.08%) osq.. syscall_exit_to_user_mode (43,203,238 samples, 0.16%) __sysvec_apic_timer_interrupt (2,662,881 samples, 0.01%) _copy_from_user (1,015,660,918 samples, 3.83%) _cop.. get_pipe_info (88,389,900 samples, 0.33%) tick_sched_handle (2,662,881 samples, 0.01%) asm_sysvec_apic_timer_interrupt (2,662,881 samples, 0.01%) vm_unmapped_area (2,892,862 samples, 0.01%) try_grab_folio (289,286,952 samples, 1.09%) get_user_pages_fast (7,086,063 samples, 0.03%) page_cache_pipe_buf_release (43,300,058 samples, 0.16%) __mutex_lock.constprop.0 (9,759,850,719 samples, 36.77%) __mutex_lock.constprop.0 with_vmsplice (4,241,758 samples, 0.02%) exit_to_user_mode_prepare (30,683,997 samples, 0.12%) pud_huge (4,099,890 samples, 0.02%) mas_empty_area_rev (2,892,862 samples, 0.01%) copy_user_enhanced_fast_string (76,785,562 samples, 0.29%) entry_SYSCALL_64_after_hwframe (26,054,316,296 samples, 98.17%) entry_SYSCALL_64_after_hwframe __do_sys_vmsplice (25,963,594,680 samples, 97.82%) __do_sys_vmsplice wait_for_space (742,762,247 samples, 2.80%) wa.. __hrtimer_run_queues (2,662,881 samples, 0.01%) check_stack_object (58,579,278 samples, 0.22%) get_unmapped_area (2,892,862 samples, 0.01%) entry_SYSCALL_64 (259,954,397 samples, 0.98%) syscall_exit_to_user_mode_prepare (2,750,376 samples, 0.01%) entry_SYSCALL_64_after_hwframe (2,892,862 samples, 0.01%) add_to_pipe (4,557,388,711 samples, 17.17%) add_to_pipe __mmap (2,892,862 samples, 0.01%) iovec_from_user.part.0 (1,403,501,367 samples, 5.29%) iovec_.. hrtimer_interrupt (2,662,881 samples, 0.01%) all (26,541,242,666 samples, 100%) tick_sched_timer (2,662,881 samples, 0.01%) sysvec_apic_timer_interrupt (2,662,881 samples, 0.01%) iov_iter_get_pages2 (3,963,033,797 samples, 14.93%) iov_iter_get_pages2 mutex_unlock (158,874,728 samples, 0.60%) vmsplice (26,534,006,381 samples, 99.97%) vmsplice

Like with write, we are spending a significant amount of time (37%) in __mutex_lock.constprop.0. However, there is no _alloc_pages and no _raw_spin_lock_irq. And, instead of copy_user_enhanced_fast_string, we find add_to_pipe, import_iovec and iov_iter_get_pages2. From this, we can see that how vmsplice bypasses the expensive parts of the write system call.

As an aside, I was a bit surprised about the effect of the buffer size, especially when not using vmsplice. It looks like minimizing the number of system calls is not always the most important thing to do.

WhatBuffer sizeThroughput (GB/s)System callsInstructionsins/syscall
./write3276899327682273736849042250
./write65536150163846654385141523319
./write13107220781927042888974135235
zeroes32768173276800318598640899723
zeroes655361316384003175085726419379
zeroes131072128192003500273377342728

Wrapping Up

There you have it. Writing to a pipe is ten times slower than writing to raw memory. And this is because, when writing to a pipe, we need to spend a lot of time taking a lock, and we cannot use vector instructions efficiently.

In principle, we could move data at 167 GB/s, but we need to avoid the cost of locking the buffer, and the cost of saving and restoring the SIMD context. This is exactly what splice and vmsplice do. They are often described as avoiding copying data between buffers, and this is true, but, most importantly, they completely bypass the conservative kernel code with extensive procedures and scalar code.

Errata: the conclusion about the kernel having overhead because of memory management is still true. However, not using vector instructions is not the penalty I thought it was. In addition, it turns out that processes cannot communicate through L1 cache. So, actually reading that data would incur a serious penalty and the 167 GB/s will not be reached in practice.

  1. Of course, they need to write code fast enough for exploit what vmsplice enables, but the point is that the first group’s performance is limited by not using vmsplice. ↩︎
  2. All benchmarks were performed on my personal desktop computer, which features a 7950X3D and DDR5 RAM overclocked to 6000T/s. And I am running Debian 12 with a 6.1.0-18-amd64 Linux kernel. CPU mitigations were disabled using the Linux kernel option mitigations=off.
    As mentioned by ais523, it is important to pin the processes to specific cores. I have used logical cores 27 and 29, but I trim taskset -c 27 and taskset -c 29 from the commands in this article for the sake of readability. Look into /sys/devices/system/cpu/cpu*/acpi_cppc/highest_perf to know the relative performance of your cores. ↩︎
  3. See “L1 Cache write” in the last row of the second table from the bottom of the LanOC review. This gives 2,518.4 GB/s for all 16 physical cores, or 157.4 GB/s per physical core. ↩︎
  4. Note that the compiler was still smart enough to use memset instead of memcpy. Actually using memcpy as a naive compilation would do, while keeping an optimized build, is actually not trivial. ↩︎
  5. I had to install linux-image-6.1.0-18-amd64-dbg to get the file /usr/lib/debug/boot/System.map-6.1.0-18-amd64 with the symbols. ↩︎
  6. Someone at Hacker News has the answer! ↩︎