Quentin Santos

Obsessed with computers since 2002

Author: Quentin Santos

  • Beware Rust Buffering

    This article is about a small pitfall I stumbled uopen Rust. Since it took me relatively too much time to figure it out, I made this article as a reminder. Also, there is a post scriptum, also about buffers, but not about Rust.

    While writing my article about Linux pipes, I needed a small program to write some data, any data, very, very fast. So, I wrote ./zeroes, which just writes… zeroes.

    Note that ./zeroes | program is very different from <dev/zero program. The latter just opens the special file /dev/zero on file descriptor 0 (stdin). And reading from this file is nothing more than setting the program’s buffer to zero, without any data being copied around.

    What I wanted to do is closer to </dev/zero cat | program. But I wanted to peel the layers, by reducing this to the smallest working example. My first try was the obvious:

    use std::io::Write;
    fn main() {
        let vec = vec![b'\0'; 1 << 17];
        let mut total_written = 0;
        let mut stdout = std::io::stdout().lock();
        while let Ok(n) = stdout.write(&vec) {
            total_written += n;
            if total_written >= (100 << 30) {
                break;
            }
        }
    }

    However, this only achieves 25 GB/s when writing to /dev/null, even though doing so should really do nothing. And barely 8 GB/s when writing to an actual pipe. The reason is that, in the standard library of Rust, stdout is actually line-buffered. This means that we are copying the contents of vec to another (user land) buffer before actually calling write.

    We can work around this by manually creating a file for file descriptor 1 (stdout). See also the related discussion on the Rust project. This results in:

    use std::io::Write;
    use std::os::fd::FromRawFd;
    fn main() {
        let vec = vec![b'\0'; 1 << 17];
        let mut total_written = 0;
        let mut stdout = unsafe { std::fs::File::from_raw_fd(1) };
        while let Ok(n) = stdout.write(&vec) {
            total_written += n;
            if total_written >= (100 << 30) {
                break;
            }
        }
    }

    With this, writing to /dev/null becomes free (1,300 GB/s!1), and we get to 11 GB/s when writing to a pipe.

    Post Scriptum

    Oh, by the way, yes achieves 14 GB/s. Yes, yes, the small silly program to just answer y to all questions in an annoying program with many questions.

    It turns out that the size I used for the buffer is not optimal. You can see below the effect of varying the size of the buffer.

    Buffer sizeThroughput
    1,0245.4 GB/s
    2,0487.8 GB/s
    4,09612.5 GB/s
    8,19214.1 GB/s
    16,38415.5 GB/s
    32,76816.0 GB/s
    65,53611.7 GB/s
    131,07211.1 GB/s
    262,14410.7 GB/s
    524,28810.7 GB/s
    1,048,57610.7 GB/s

    Using a larger buffer improves the performance, since it allow us to perform fewer system calls. However, the performance falls of a cliff once we go above 32 kB, because we are exceeding the capacity of the L1 data cache of the 7950X3D (the CPU I am testing on):

    $ lscpu -C=NAME,ONE-SIZE   
    NAME ONE-SIZE
    L1d       32K
    L1i       32K
    L2         1M
    L3        96M

    But I copied that value (1 << 17) from what pv was doing. Why does it choose this buffer size? This is set at this line:

    sz = (size_t) (sb.st_blksize * 32);

    st_blksize is the “preferred block size for efficient filesystem I/O”. We can get it with the stat command:

    $ stat -c %o /dev/zero
    4096

    Looking at the throughput table from above, 4096 bytes is actually suboptimal for transfer. pv chooses to increase that value 32-fold to counteract this. In practice, you won’t get any inter-process communication at the speed of the L1 data cache, so this is a good heuristic.

    1. In practice, you’re just measuring how fast you can evaluate total_written += n. ↩︎