Linux Kernel Patches To Use AMD INVLPGB Instruction Show Huge Speed-Up

Since AMD Zen 3 processors there has been the INVLPGB instruction for invalidating TLB entries for a range of pages with broadcast. As mentioned back during the AMD EPYC 7003 “Milan” launch, INVLPGB usage around this new instruction was limited… Over the past nearly four years the INVLPGB use has been limited in part because Intel CPUs do not support it but there is now a Linux kernel patch series for making use of INVLPGB for some nice performance benefits.

INVLPGB was added with AMD Zen 3 processors and continues being supported under newer Zen processors too. While there was the GCC compiler support and some limited use of it by the KVM code, the Linux kernel hasn’t widely made use of INVLPGB… In part because Intel engineers typically carry out much of the new x86 instruction optimizations within the Linux kernel and Intel processors do not currently support INVLPGB.

Today though open-source developer Rik van Riel with Meta (Facebook) posted a set of 10 kernel paches to begin making use of the AMD broadcast TLB invalidation functionality. The patches allow the kernel to invalidate TLB entries on remote CPUs without needing to send IPIs and without having to wait for remote CPUs to handle those interrupts. Plus using this INVLPGB instruction leads to less interruption for whatever workloads were running on those CPUs affected.

But most exciting for end-users are the straight-up benefits to using this INVLPGB instruction by the Linux kernel on modern AMD CPUs:

“Combined with the removal of unnecessary lru_add_drain calls (see https://lkml.org/lkml/2024/12/19/1388) this results in a nice performance boost for the will-it-scale tlb_flush2_threads test on an AMD Milan system with 36 cores:

– vanilla kernel: 527k loops/second

– lru_add_drain removal: 731k loops/second

– only INVLPGB: 527k loops/second

– lru_add_drain + INVLPGB: 1157k loops/second

Profiling with only the INVLPGB changes showed while TLB invalidation went down from 40% of the total CPU time to only around 4% of CPU time, the contention simply moved to the LRU lock.

Fixing both at the same time about doubles the number of iterations per second from this case.”

Some pretty wild gains in this particular test case with the throughput more than doubled from these Linux kernel patches… More surprising is that it’s taken ~4 years for these Linux kernel patches to materialize with Zen 3 having first debuted in late 2020.

As part of the patches is also enabling the AMD Translation Cache Extensions (TCE) that can help reduce the T LB miss rate. The broadcast TLB invalidation using INVPLGB for capable AMD Zen processors is done for multi-threaded processes using three or more CPUs as to not exhaust the PCID space. Local TLB flushes are still used for single threaded processors.

The patches are now undergoing review on the Linux kernel mailing list.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Leave a Reply