Beware Rust Buffering

This article is about a small pitfall I stumbled uopen Rust. Since it took me relatively too much time to figure it out, I made this article as a reminder. Also, there is a post scriptum, also about buffers, but not about Rust.

While writing my article about Linux pipes, I needed a small program to write some data, any data, very, very fast. So, I wrote ./zeroes, which just writes… zeroes.

Note that ./zeroes | program is very different from <dev/zero program. The latter just opens the special file /dev/zero on file descriptor 0 (stdin). And reading from this file is nothing more than setting the program’s buffer to zero, without any data being copied around.

What I wanted to do is closer to </dev/zero cat | program. But I wanted to peel the layers, by reducing this to the smallest working example. My first try was the obvious:

use std::io::Write;
fn main() {
    let vec = vec![b'\0'; 1 << 17];
    let mut total_written = 0;
    let mut stdout = std::io::stdout().lock();
    while let Ok(n) = stdout.write(&vec) {
        total_written += n;
        if total_written >= (100 << 30) {
            break;
        }
    }
}

However, this only achieves 25 GB/s when writing to /dev/null, even though doing so should really do nothing. And barely 8 GB/s when writing to an actual pipe. The reason is that, in the standard library of Rust, stdout is actually line-buffered. This means that we are copying the contents of vec to another (user land) buffer before actually calling write.

We can work around this by manually creating a file for file descriptor 1 (stdout). See also the related discussion on the Rust project. This results in:

use std::io::Write;
use std::os::fd::FromRawFd;
fn main() {
    let vec = vec![b'\0'; 1 << 17];
    let mut total_written = 0;
    let mut stdout = unsafe { std::fs::File::from_raw_fd(1) };
    while let Ok(n) = stdout.write(&vec) {
        total_written += n;
        if total_written >= (100 << 30) {
            break;
        }
    }
}

With this, writing to /dev/null becomes free (1,300 GB/s!1), and we get to 11 GB/s when writing to a pipe.

Post Scriptum

Oh, by the way, yes achieves 14 GB/s. Yes, yes, the small silly program to just answer y to all questions in an annoying program with many questions.

It turns out that the size I used for the buffer is not optimal. You can see below the effect of varying the size of the buffer.

Buffer sizeThroughput
1,0245.4 GB/s
2,0487.8 GB/s
4,09612.5 GB/s
8,19214.1 GB/s
16,38415.5 GB/s
32,76816.0 GB/s
65,53611.7 GB/s
131,07211.1 GB/s
262,14410.7 GB/s
524,28810.7 GB/s
1,048,57610.7 GB/s

Using a larger buffer improves the performance, since it allow us to perform fewer system calls. However, the performance falls of a cliff once we go above 32 kB, because we are exceeding the capacity of the L1 data cache of the 7950X3D (the CPU I am testing on):

$ lscpu -C=NAME,ONE-SIZE   
NAME ONE-SIZE
L1d       32K
L1i       32K
L2         1M
L3        96M

But I copied that value (1 << 17) from what pv was doing. Why does it choose this buffer size? This is set at this line:

sz = (size_t) (sb.st_blksize * 32);

st_blksize is the “preferred block size for efficient filesystem I/O”. We can get it with the stat command:

$ stat -c %o /dev/zero
4096

Looking at the throughput table from above, 4096 bytes is actually suboptimal for transfer. pv chooses to increase that value 32-fold to counteract this. In practice, you won’t get any inter-process communication at the speed of the L1 data cache, so this is a good heuristic.

  1. In practice, you’re just measuring how fast you can evaluate total_written += n. ↩︎