Google engineer Roman Gushchin has proposed the ability for the Linux kernel to customize the out-of-memory “OOM” behavior using BPF programs.
While there is the likes of systemd-oomd for a user-space out-of-memory killer service, Linux memory management expert Roman Gushchin of Google has proposed the ability for (e)BPF programs to customize the OOM behavior within the Linux kernel.
Under the patches proposed today, the Linux kernel out-of-memory handling policy could be manipulated via BPF programs as well as Pressure Stall Information (PSI) based OOM invocation.
Back in 2023 was a proposal from a Bytedance engineer to provide some OOM + BPF integration. Roman Gushchin explained with today’s patch series:
“The idea to use bpf for customizing the OOM handling is not new, but unlike the previous proposal, which augmented the existing task ranking policy, this one tries to be as generic as possible and leverage the full power of the modern bpf.
It provides a generic interface which is called before the existing OOM killer code and allows implementing any policy, e.g. picking a victim task or memory cgroup or potentially even releasing memory in other ways, e.g. deleting tmpfs files (the last one might require some additional but relatively simple changes).
The past attempt to implement memory-cgroup aware policy showed that there are multiple opinions on what the best policy is. As it’s highly workload-dependent and specific to a concrete way of organizing workloads, the structure of the cgroup tree etc, a customizable bpf-based implementation is preferable over a in-kernel implementation with a dozen on sysctls.
The second part is related to the fundamental question on when to declare the OOM event. It’s a trade-off between the risk of unnecessary OOM kills and associated work losses and the risk of infinite trashing and effective soft lockups. In the last few years several PSI-based userspace solutions were developed (e.g. OOMd or systemd-OOMd). The common idea was to use userspace daemons to implement custom OOM logic as well as rely on PSI monitoring to avoid stalls. In this scenario the userspace daemon was supposed to handle the majority of OOMs, while the in-kernel OOM killer worked as the last resort measure to guarantee that the system would never deadlock on the memory. But this approach creates additional infrastructure churn: userspace OOM daemon is a separate entity which needs to be deployed, updated, monitored. A completely different pipeline needs to be built to monitor both types of OOM events and collect associated logs. A userspace daemon is more restricted in terms on what data is available to it. Implementing a daemon which can work reliably under a heavy memory pressure in the system is also tricky.”
This BPF + OOM approach was originally raised via an RFC patch series back in April but now has graduated past that Request For Comments (RFC) phase.
The v1 patch series is now available for testing for those interested in BPF-based out-of-memory customization within the Linux kernel.