Intel engineer Chen Yu posted a fresh round of Linux kernel patches working on cache-aware scheduling/load-balancing for this functionality being sought after both by Intel and AMD. The new patches should address some performance regressions observed in the prior patches.
For the past number of months there have been patches floated on the Linux kernel mailing list for cache-aware load balancing that have the potential to help with performance on modern AMD and Intel processors, especially for larger server processors. The focus with the cache aware scheduling is to be able to aggregate tasks with likely shared resources into the same cache domain for better cache locality, especially on recent Intel Xeon and AMD EPYC processors.
Early benchmark results have shown promising potential with cache aware load balancing / scheduling while the v4 patches posted this weekend address some performance regressions that turned up.
Chen Yu explained with the new v4 patch series:
“The main change in v4 is to address the performance regressions reported in v3, which are caused by over-aggregation of tasks in a single LLC, but they don’t actually share data. Such aggregation could cause regression on platforms with smaller LLC size. It can also occur when running workloads with a large memory footprint (e.g., stream) or when workloads involve too many threads (e.g., hackbench).
Patches 1 to 20 are almost identical to those in v3; the key fixes are included in patches 21 to 28. The approach involves tracking a process’s resident pages and comparing this to the LLC cache size.”
There is also a new knob via /sys/kernel/debug/sched/sched_cache_ignore_rss for being able to turn off cache-aware scheduling or to make it more aggressive:
“To that effect, /sys/kernel/debug/sched/sched_cache_ignore_rss is added where
0 turns off cache aware scheduling entirely
1 turns off cache aware scheduling when resident memory of process
exceed LLC size (default)
100 RSS will not be taken into account during cache aware scheduling
N translates to turn off cache aware scheduling when RSS is greater than (N-1) * 256 * LLC sizeSo for folks who want to make cache aware scheduling to be aggressive and they know their process threads share lots of data, they could set it to 100.
Similarly, the number of active threads within each process is monitored and compared to the number of cores (excluding SMT) in each LLC.”
We’ll see how the v4 patch testing and review goes. Hopefully it won’t be too long before seeing cache aware scheduling upstreamed to the mainline Linux kernel.