Sent out today and already merged for the in-development Linux 6.18 kernel is the latest batch of 64-bit ARM “ARM64” architecture fixes. Most notable is a fix for addressing a “catastrophic performance issue” that was uncovered.
Catching my attention with today’s ARM64 pull request was this paragraph:
“The other interesting fix is addressing a catastophic performance issue with our per-cpu atomics discovered by Paul in the SRCU locking code but which took some interactions with the hardware folks to resolve.”
Catastrophic performance issue? When digging through the patch and mailing list discussion, it started off in this thread by Linux developer Paul McKenney:
“To make event tracing safe for PREEMPT_RT kernels, I have been creating optimized variants of SRCU readers that use per-CPU atomics. This works quite well, but on ARM Neoverse V2, I am seeing about 100ns for a srcu_read_lock()/srcu_read_unlock() pair, or about 50ns for a single per-CPU atomic operation. This contrasts with a handful of nanoseconds on x86 and similar on ARM for a atomic_set(&foo, atomic_read(&foo) + 1).”
It turns out per-CPU atomics on ARM64 were extremely expensive. From the ensuing discussion it turned out that if disabling the Linux kernel support for the Large System Extensions (LSE) atomic instructions introduced in ARMv8.1, the performance wasn’t quite so bad.
Linux developer Willy Tarreau also chimed in around finding costly atomics use on ARM 64-bit hardware too:
“This is super interesting! I’ve blindly applied a similar change to all of our atomics in haproxy and am seeing a consistent 2-7% perf increase depending on the tests on a 80-core Ampere Altra (neoverse-n1). There as well we’re significantly using atomics to read/update mostly local variables as we avoid sharing as much as possible. I’m pretty sure it does hurt in certain cases, and we don’t have this distinction of per_cpu variants like here, however that makes me think about adding a “mostly local” variant that we can choose from depending on the context.”
Long story short to the solution, to address the high latency, this patch is now merged for ARM64 to use the load LSE atomics for the non-return per-CPU atomic operations:
“The non-return per-CPU this_cpu_*() atomic operations are implemented as STADD/STCLR/STSET when FEAT_LSE is available. On many microarchitecture implementations, these instructions tend to be executed “far” in the interconnect or memory subsystem (unless the data is already in the L1 cache). This is in general more efficient when there is contention as it avoids bouncing cache lines between CPUs. The load atomics (e.g. LDADD without XZR as destination), OTOH, tend to be executed “near” with the data loaded into the L1 cache.
STADD executed back to back as in srcu_read_{lock,unlock}*() incur an additional overhead due to the default posting behaviour on several CPU implementations. Since the per-CPU atomics are unlikely to be used concurrently on the same memory location, encourage the hardware to to execute them “near” by issuing load atomics – LDADD/LDCLR/LDSET – with the destination register unused (but not XZR).”
With that, the higher latency originally reported are much more reasonable than before for this otherwise “catastophic performance issue”.
The ARM64 fixes were merged today ahead of the Linux 6.18-rc6 kernel on Sunday and then Linux 6.18 stable around the end of November/
