Merged today ahead of the Linux 6.18-rc5 kernel due out on Sunday is a partial fix for a performance regression observed on IBM POWER hardware.
Since the “IMMUTABLE” flag was dropped from the kernel’s FUTEX code for the Linux 6.17 cycle, IBM engineers have noted a performance regression primarily affecting their hardware. Now for Linux 6.18-rc5 that performance regression is at least cut in half.
Intel engineer Peter Zijlstra worked out the partial fix/workaround by optimizing the per-CPU reference ocunting in the Futex code. Zijlstra explained with the now-merged patch:
“Shrikanth noted that the per-cpu reference counter was still some 10% slower than the old immutable option (which removes the reference counting entirely).
Further optimize the per-cpu reference counter by:
– switching from RCU to preempt;
– using __this_cpu_*() since we now have preempt disabled;
– switching from smp_load_acquire() to READ_ONCE().This is all safe because disabling preemption inhibits the RCU grace period exactly like rcu_read_lock().
Having preemption disabled allows using __this_cpu_*() provided the only access to the variable is in task context — which is the case here.
Furthermore, since we know changing fph->state to FR_ATOMIC demands a full RCU grace period we can rely on the implied smp_mb() from that to replace the acquire barrier().
This is very similar to the percpu_down_read_internal() fast-path.
The reason this is significant for PowerPC is that it uses the generic this_cpu_*() implementation which relies on local_irq_disable() (the x86 implementation relies on it being a single memop instruction to be IRQ-safe). Switching to preempt_disable() and __this_cpu*() avoids this IRQ state swizzling. Also, PowerPC needs LWSYNC for the ACQUIRE barrier, not having to use explicit barriers safes a bunch.
Combined this reduces the performance gap by half, down to some 5%.”
This improvement was merged to the Linux 6.18 Git code today as the sole change of this week’s locking/urgent pull request.
