Linux Looks Ready To Introduce "Sheaves" For Opt-In Per-CPU Array-Based Caching Layer

A patch series that has been in development for a while now introduces the concept of “sheaves” for an opt-in, per-CPU and array-based caching layer for the SLUB kernel allocator. It looks like the sheaves patches are likely to be introduced for the Linux 6.18 kernel if no objections are raised.

SUSE engineer Vlastimil Babka has been leading the effort on “Sheaves” for the SLUB allocator code. He explained on the LKML mailing list patch series:

“This series adds an opt-in percpu array-based caching layer to SLUB. It has evolved to a state where kmem caches with sheaves are compatible with all SLUB features (slub_debug, SLUB_TINY, NUMA locality considerations). My hope is therefore that it can eventually be enabled for all kmem caches and replace the cpu (partial) slabs.

Note the name “sheaf” was invented by Matthew Wilcox so we don’t call the arrays magazines like the original Bonwick paper. The per-NUMA-node cache of sheaves is thus called “barn”.”

Those patches introducing sheaves and barns have been queued up within the slab.git’s slab/for-next branch and also tagged “slab/for-6.18/sheaves”.

Sheaves patches queued

Given the sheaves patches are in a “-next” branch and also marked for the Linux 6.18 cycle, they will likely be submitted for the Linux 6.18 merge window in early October.

As for the benefits of Sheaves, the cover letter went on to explain:

“The motivation comes mainly from the ongoing work related to VMA locking scalability and the related maple tree operations. This is why VMA and maple nodes caches are sheaf-enabled in the patchset. In v5 I include Liam’s patches for full maple tree conversion that uses the improved preallocation API.

A sheaf-enabled cache has the following expected advantages:

– Cheaper fast paths. For allocations, instead of local double cmpxchg, thanks to local_trylock() it becomes a preempt_disable() and no atomic operations. Same for freeing, which is otherwise a local double cmpxchg only for short term allocations (so the same slab is still active on the same cpu when freeing the object) and a more costly locked double cmpxchg otherwise.

– kfree_rcu() batching and recycling. kfree_rcu() will put objects to a separate percpu sheaf and only submit the whole sheaf to call_rcu() when full. After the grace period, the sheaf can be used for allocations, which is more efficient than freeing and reallocating individual slab objects (even with the batching done by kfree_rcu() implementation itself). In case only some cpus are allowed to handle rcu callbacks, the sheaf can still be made available to other cpus on the same node via the shared barn. The maple_node cache uses kfree_rcu() and thus can benefit from this.

– Preallocation support. A prefilled sheaf can be privately borrowed to perform a short term operation that is not allowed to block in the middle and may need to allocate some objects. If an upper bound (worst case) for the number of allocations is known, but only much fewer allocations actually needed on average, borrowing and returning a sheaf is much more efficient then a bulk allocation for the worst case followed by a bulk free of the many unused objects. Maple tree write operations should benefit from this.

– Compatibility with slub_debug. When slub_debug is enabled for a cache, we simply don’t create the percpu sheaves so that the debugging hooks (at the node partial list slowpaths) are reached as before. The same thing is done for CONFIG_SLUB_TINY. Sheaf preallocation still works by reusing the (ineffective) paths for requests exceeding the cache’s sheaf_capacity. This is in line with the existing approach where debugging bypasses the fast paths and SLUB_TINY [prefers] memory savings over performance.”

This should be a nice addition to the Linux kernel for the code adapted for making use of this new per-CPU caching layer capability. Here’s to hoping all goes well on its way toward the Linux 6.18 kernel.