Linux I/O expert and block/IO_uring maintainer Jens Axboe of Meta has recently revisited his patches around uncached buffered I/O. Back in 2019 the “RWF_UNCACHED” effort was started by Axboe to address a throughput cliff in performance once the page cache fills up. That work faded away but Axboe recently took to crafting a set of fresh patches for implementing uncached buffered I/O and they are showing extremely promising results.
Jens Axboe posted today to Twitter/X around his work on this uncached buffered I/O for Linux in 2024:
“Uncached buffered IO is back, after a 5 year hiatus. Simpler and cleaner now. Up to 65-75% improvement, at half the CPU usage on my system. And none of the nonsense of the unpredictability of the page cache.”
These new patches are currently residing within his buffered-uncached.2 Git branch. In there for the patch adding the RWF_UNCACHED flag, he explains:
“Add RWF_UNCACHED as a read operation flag, which means that any data read wil be removed from the page cache upon completion. Uses the page cache to synchronize, and simply prunes folios that were instantiated when the operation completes.
…
You can think of uncached buffered IO as being the much more attractive cousing of O_DIRECT – it has none of the restrictions of O_DIRECT. Yes, it will copy the data, but unlike regular buffered IO, it doesn’t run into the unpredictability of the page cache in terms of reclaim. As an example, on a test box with 32 drives, reading them with buffered IO looks as follows:Reading bs 65536, uncached 0
1s: 145945MB/sec
2s: 158067MB/sec
3s: 157007MB/sec
4s: 148622MB/sec
5s: 118824MB/sec
6s: 70494MB/sec
7s: 41754MB/sec
8s: 90811MB/sec
9s: 92204MB/sec
10s: 95178MB/sec
11s: 95488MB/sec
12s: 95552MB/sec
13s: 96275MB/secwhere it’s quite easy to see where the page cache filled up, and performance went from good to erratic, and finally settles at a much lower rate.
…
If the same test case is run with RWF_UNCACHED set for the buffered read, the output looks as follows:Reading bs 65536, uncached 0
1s: 153144MB/sec
2s: 156760MB/sec
3s: 158110MB/sec
4s: 158009MB/sec
5s: 158043MB/sec
6s: 157638MB/sec
7s: 157999MB/sec
8s: 158024MB/sec
9s: 157764MB/sec
10s: 157477MB/sec
11s: 157417MB/sec
12s: 157455MB/sec
13s: 157233MB/sec
14s: 156692MB/secwhich is just chugging along at ~155GB/sec of read performance.
…
where just the test app is using CPU, no reclaim is taking place outside of the main thread. Not only is performance 65% better, it’s also using half the CPU to do it.”
Now that is a beautiful win.
Another patch adds support for RWF_UNCACHED for buffered writes:
“If RWF_UNCACHED is set for a write, mark the folios being written with drop_writeback. Then writeback completion will drop the pages. The write_iter handler simply kicks off writeback for the pages, and writeback completion will take care of the rest…the behavior is fully predictable, performing the same throughout even after the page cache would otherwise have fully filled with dirty data. It’s also about 75% faster, and using half the CPU of the system compared to the normal buffered write.”
That’s some really great work that will hopefully make it to the mainline Linux kernel with these very exciting results.