This article is about a small pitfall I stumbled uopen Rust. Since it took me relatively too much time to figure it out, I made this article as a reminder. Also, there is a post scriptum, also about buffers, but not about Rust.
While writing my article about Linux pipes, I needed a small program to write some data, any data, very, very fast. So, I wrote ./zeroes
, which just writes… zeroes.
Note that ./zeroes | program
is very different from <dev/zero program
. The latter just opens the special file /dev/zero
on file descriptor 0 (stdin
). And reading from this file is nothing more than setting the program’s buffer to zero, without any data being copied around.
What I wanted to do is closer to </dev/zero cat | program
. But I wanted to peel the layers, by reducing this to the smallest working example. My first try was the obvious:
use std::io::Write;
fn main() {
let vec = vec![b'\0'; 1 << 17];
let mut total_written = 0;
let mut stdout = std::io::stdout().lock();
while let Ok(n) = stdout.write(&vec) {
total_written += n;
if total_written >= (100 << 30) {
break;
}
}
}
However, this only achieves 25 GB/s when writing to /dev/null
, even though doing so should really do nothing. And barely 8 GB/s when writing to an actual pipe. The reason is that, in the standard library of Rust, stdout
is actually line-buffered. This means that we are copying the contents of vec
to another (user land) buffer before actually calling write
.
We can work around this by manually creating a file for file descriptor 1 (stdout
). See also the related discussion on the Rust project. This results in:
use std::io::Write;
use std::os::fd::FromRawFd;
fn main() {
let vec = vec![b'\0'; 1 << 17];
let mut total_written = 0;
let mut stdout = unsafe { std::fs::File::from_raw_fd(1) };
while let Ok(n) = stdout.write(&vec) {
total_written += n;
if total_written >= (100 << 30) {
break;
}
}
}
With this, writing to /dev/null
becomes free (1,300 GB/s!1), and we get to 11 GB/s when writing to a pipe.
Post Scriptum
Oh, by the way, yes
achieves 14 GB/s. Yes, yes
, the small silly program to just answer y
to all questions in an annoying program with many questions.
It turns out that the size I used for the buffer is not optimal. You can see below the effect of varying the size of the buffer.
Buffer size | Throughput |
---|---|
1,024 | 5.4 GB/s |
2,048 | 7.8 GB/s |
4,096 | 12.5 GB/s |
8,192 | 14.1 GB/s |
16,384 | 15.5 GB/s |
32,768 | 16.0 GB/s |
65,536 | 11.7 GB/s |
131,072 | 11.1 GB/s |
262,144 | 10.7 GB/s |
524,288 | 10.7 GB/s |
1,048,576 | 10.7 GB/s |
Using a larger buffer improves the performance, since it allow us to perform fewer system calls. However, the performance falls of a cliff once we go above 32 kB, because we are exceeding the capacity of the L1 data cache of the 7950X3D (the CPU I am testing on):
$ lscpu -C=NAME,ONE-SIZE
NAME ONE-SIZE
L1d 32K
L1i 32K
L2 1M
L3 96M
But I copied that value (1 << 17
) from what pv
was doing. Why does it choose this buffer size? This is set at this line:
sz = (size_t) (sb.st_blksize * 32);
st_blksize
is the “preferred block size for efficient filesystem I/O”. We can get it with the stat
command:
$ stat -c %o /dev/zero
4096
Looking at the throughput table from above, 4096 bytes is actually suboptimal for transfer. pv
chooses to increase that value 32-fold to counteract this. In practice, you won’t get any inter-process communication at the speed of the L1 data cache, so this is a good heuristic.
- In practice, you’re just measuring how fast you can evaluate
total_written += n
. ↩︎