Cyber Week 2025: If you wish to enjoy the site ad-free, multi-page articles on a single page, and other benefits, consider joining Phoronix Premium. This week only is our Cyber Week promotion to help support all of our Linux/open-source hardware and software operations while enjoying the added premium benefits at a discounted rate. Thanks for your consideration and support this holiday season with providing daily original content for over 21 years.
In addition to the proposed Hierarchical Queued NUMA-aware spinlocks for better performance, another interesting performance-enhancing patch series posted in the past 24 hours for the Linux kernel is for improving the performance of single-threaded tasks running on high core count CPU desktops / workstations / servers.
Gabriel Krisman Bertazi of SUSE posted the request for comments (RFC) patch series to better the performance of single-threaded tasks with today’s many-core CPUs. The optimization is focused around the Linux kernel’s “rss_stat” structure that holds statistics around the Resident Set Size (RSS) for the process with the amount of memory in use.
Gabriel Krisman Bertazi explained of this rss_stat optimization for single-threaded tasks to speed up its initialization and teardown:
“The cost of the pcpu memory allocation is non-negligible for systems with many cpus, and it is quite visible when forking a new task, as reported in a few occasions. In particular, Jan Kara reported the commit introducing per-cpu counters for rss_stat caused a 10% regression of system time for gitsource in his system. In that same occasion, Jan suggested we special-cased the single-threaded case: since we know there won’t be frequent remote updates of rss_stats for single-threaded applications, we could special case it with a local counter for most updates, and an atomic counter for the infrequent remote updates. This patchset implements this idea.”
The end result are some nice performance gains for single-threaded tasks running on high core count Linux systems. In synthetic benchmarks a 6~15% improvement or in a more realistic benchmark around 1.5% better performance. Still enough to make pursuing it worthwhile:
“On a 256c system, where the pcpu allocation of the rss_stats is quite noticeable, this has reduced the wall-clock time between 6% – 15% (depending on the number of cores) of an artificial fork-intensive microbenchmark (calling /bin/true in a loop). In a more realistic benchmark, it showed an improvement of 1.5% on kernbench elapsed time.”
Those interested in learning more can do so via this RFC patch series. It will be fun to benchmark these patches if they look like they’ll end up in mainline for enhancing EPYC and Threadripper systems.
