A new patch posted to the Linux kernel mailing list aims to address the high wake-up latency experienced on modern Intel Xeon server platforms. With Sapphire Rapids and newer, “excessive” wakeup latencies with the Linux menu governor and NOHZ_FULL configuration can negatively impair Xeon CPUs for latency-sensitive workloads but a 16 line patch aims to better improve the situation. That is, changing one line of actual code and the rest being code comments.
Cloud engineer Ionut Nechita of Wind River has been working to address the high wakeup latency on modern Intel Xeon platforms going back to Sapphire Rapids and still persisting with latest-generation Granite Rapids processors. Around a ~150us wakeup latency with the menu governor and NOHZ_FULL kernel builds is hurting the performance for latency-sensitive applications compare to Ice Lake and Skylake Xeons having 12~21 us latency.
An issue with the menu governor code was spotted causing for very deep package C-states and that being too costly for modern Xeon servers due to DDR5 power management overhead, the per-tile power gating on modern Xeon CPUs, CXL link restoration, and other complexities of modern servers.
The proposed improvement is to simply add a 25% safety margin to the menu governor code to protect against “excessively” deep states while not hurting the power efficiency too much. The 25% safety margin is to reduce the risk of selecting too shallow of a state while also avoiding selecting unnecessarily deep states.
“When the tick is already stopped and the predicted idle duration is short (< TICK_NSEC), the original code uses next_timer_ns directly. This can be too conservative on platforms with high C-state exit latencies.
On Intel server platforms (2022+), this causes excessive wakeup latencies (~150us) when the actual idle duration is much shorter than next_timer_ns, because the governor selects package C-states (PC6) when shallower states would be more appropriate.
Add a 25% safety margin to the prediction instead of using next_timer_ns directly, while still clamping to next_timer_ns to avoid selecting unnecessarily deep states.
Testing shows this reduces qperf latency from 151us to ~30us on affected platforms while maintaining good power efficiency. Platforms with fast C-state transitions (Ice Lake: 12us, Skylake: 21us) see minimal impact.”
The benchmark results of adding a 25% safety margin to the menu governor speaks for itself:
“Testing on Sapphire Rapids with qperf tcp_lat:
– Before: 151us average latency
– After: ~30us average latency
– Improvement: 5x latency reductionTesting on Ice Lake and Skylake shows minimal impact:
– Ice Lake: 12us → 12us (no regression)
– Skylake: 21us → 21us (no regression)Power efficiency testing shows <1% difference in package power consumption during mixed workloads, well within measurement noise.”
On Xeon Sapphire Rapids is a 5x latency reduction reported. This should similarly benefit newer Xeon CPUs than that as well but no benchmarks were provided as part of this initial patch submission.
Of the 16 line patch, it’s adjusting just one line of code with the rest being code comments. The patch is now out for review on the Linux kernel mailing list.
