Google engineer Vinay Banakar sent out a patch this week for the Linux kernel’s memory management code to optimize TLB flushes during page reclaim and are showing very promising results.
Not yet queued for the mainline kernel but simply volleyed onto the Linux kernel mailing list this week, Vinay Banakar has been working on optimizing TLB flushing during page reclamation to very promising results. Vinay explained on the LKML:
“The current implementation in shrink_folio_list() performs full TLB flushes and issues IPIs for each individual page being reclaimed. This causes unnecessary overhead during memory reclaim, whether triggered by madvise(MADV_PAGEOUT) or kswapd, especially in scenarios where applications are actively moving cold pages to swap while maintaining high performance requirements for hot pages.
The current code:
1. Clears PTE and unmaps each page individually
2. Performs a full TLB flush on all cores using the VMA (via CR3 write) or issues individual TLB shootdowns (invlpg+invlpcid) for single-core usage
3. Submits each page individually to BIOThis approach results in:
– Excessive full TLB flushes across all cores
– Unnecessary IPI storms when processing multiple pages
– Suboptimal I/O submission patternsI initially tried using selective TLB shootdowns (invlpg) instead of full TLB flushes per each page to avoid interference with other threads. However, this approach still required sending IPIs to all cores for each page, which did not significantly improve application throughput.
This patch instead optimizes the process by batching operations, issuing one IPI per PMD instead of per page. This reduces interrupts by a factor of 512 and enables batching page submissions to BIO. The new approach:
1. Collect dirty pages that need to be written back
2. Issue a single TLB flush for all dirty pages in the batch
3. Process the collected pages for writebacks (submit to BIO)Testing shows significant reduction in application throughput impact during page-out operations. Applications maintain better performance during memory reclaim, when triggered by explicit madvise(MADV_PAGEOUT) calls.”
In a follow-up LKML message inquiring about the performance, Vinay added:
“Yes, we reduce IPIs by a factor of 512 by sending one IPI (for TLB flush) per PMD rather than per page. Since shrink_folio_list() operates on one PMD at a time, I believe we can safely batch these operations here.
Here’s a concrete example:
When swapping out 20 GiB (5.2M pages):
– Current: Each page triggers an IPI to all cores
– With 6 cores: 31.4M total interrupts (6 cores × 5.2M pages)
– With patch: One IPI per PMD (512 pages)
– Only 10.2K IPIs required (5.2M/512)
– With 6 cores: 61.4K total interrupts
– Results in ~99% reduction in total interruptsApplication performance impact varies by workload, but here’s a representative test case:
– Thread 1: Continuously accesses a 2 GiB private anonymous map (64B chunks at random offsets)
– Thread 2: Pinned to different core, uses MADV_PAGEOUT on 20 GiB private anonymous map to swap it out to SSD
– The threads only access their respective maps.Results:
– Without patch: Thread 1 sees ~53% throughput reduction during swap. If there are multiple worker threads (like thread 1), the cumulative throughput degradation will be much higher
– With patch: Thread 1 maintains normal throughputI expect a similar application performance impact when memory reclaim is triggered by kswapd.”
Very nice showing for this patch and hopefully it or some form of it will manage to make it to the mainline Linux kernel.