IO_uring continues maturing while being one of the greatest innovations within the Linux kernel in the past number of years. With Linux 6.15, IO_uring is getting even more interesting with introducing network zero-copy receive support. With this new code a 200G link could be saturated off a single CPU core in a recent demonstration.
Jens Axboe of Meta does a great job describing the exciting IO_uring network zero-copy receive support within this week’s pull request:
“This pull request adds support for zero-copy receive with io_uring,enabling fast bulk receive of data directly into application memory, rather than needing to copy the data out of kernel memory. While this version only supports host memory as that was the initial target, other memory types are planned as well, with notably GPU memory coming next.
This work depends on some networking components which were queued up on the networking side, but have now landed in your tree.
This is the work of Pavel Begunkov and David Wei. From the v14 posting:
“We configure a page pool that a driver uses to fill a hw rx queue to hand out user pages instead of kernel pages. Any data that ends up hitting this hw rx queue will thus be dma’d into userspace memory directly, without needing to be bounced through kernel memory. ‘Reading’ data out of a socket instead becomes a _notification_ mechanism, where the kernel tells userspace where the data is. The overall approach is similar to the devmem TCP proposal.
This relies on hw header/data split, flow steering and RSS to ensure packet headers remain in kernel memory and only desired flows hit a hw rx queue configured for zero copy. Configuring this is outside of the scope of this patchset.
We share netdev core infra with devmem TCP. The main difference is that io_uring is used for the uAPI and the lifetime of all objects are bound to an io_uring instance. Data is ‘read’ using a new io_uring request type. When done, data is returned via a new shared refill queue. A zero copy page pool refills a hw rx queue from this refill queue directly. Of course, the lifetime of these data buffers are managed by io_uring rather than the networking stack, with different refcounting rules.
This patchset is the first step adding basic zero copy support. We will extend this iteratively with new features e.g. dynamically allocated zero copy areas, THP support, dmabuf support, improved copy fallback, general optimisations and more.”
In a local setup, I was able to saturate a 200G link with a single CPU core, and at netdev conf 0x19 earlier this month, Jamal reported 188Gbit of bandwidth using a single core (no HT, including soft-irq). Safe to say the efficiency is there, as bigger links would be needed to find the per-core limit, and it’s considerably more efficient and faster than the existing devmem solution.”
The pull request along with associated changes have been merged for Linux 6.15!
There is also another pull request that was merged for introducing IO_uring epoll reaping:
“This adds support for reading epoll events via io_uring. While this may seem counter-intuitive (and/or productive), the reasoning here is that quite a few existing epoll event loops can easily do a partial conversion to a completion based model, but are still stuck with one (or few) event types that remain readiness based. For that case, they then need to add the io_uring fd to the epoll context, and continue to rely on epoll_wait(2) for waiting on events. This misses out on the finer grained waiting that io_uring can do, to reduce context switches and wait for multiple events in one batch reliably.
With adding support for reaping epoll events via io_uring, the whole legacy readiness based event types can still be reaped via epoll, with the overall waiting in the loop be driven by io_uring.”
Plus some other IO_uring changes in a third pull request with vectored fixed/registered buffers and other small improvements.