Experimental Linux Code For 1GB PUD-Level THPs Shows 34% Faster Memory Access Times

Early, experimental code for implementing 1GB PUD-level THPs in the Linux kernel are showing positive benchmark results but other upstream stakeholders were surprised by this patch series appearing and it looking like it could be a while until if/when the patches are mainlined for helping to reduce translaction lookaside buffer (TLB) pressure without resorting to Hugetlbfs.

Usama Arif posted a request for comments (RFC) patch series on 1GB Page Upper Directory (PUD) Transparent Huge Pages (THP) support. The intent with this work is for 1GB PUD-level THPs allowing applications to benefit from lower TLB pressure without going the route of Hugetlbfs. The cover letter on the patch series explains:

“While hugetlbfs provides 1GB huge pages today, it has significant limitations that make it unsuitable for many workloads:

1. Static Reservation: hugetlbfs requires pre-allocating huge pages at boot or runtime, taking memory away. This requires capacity planning, administrative overhead, and makes workload orchastration much much more complex, especially colocating with workloads that don’t use hugetlbfs.

4. No Fallback: If a 1GB huge page cannot be allocated, hugetlbfs fails rather than falling back to smaller pages. This makes it fragile under memory pressure.

4. No Splitting: hugetlbfs pages cannot be split when only partial access is needed, leading to memory waste and preventing partial reclaim.

5. Memory Accounting: hugetlbfs memory is accounted separately and cannot be easily shared with regular memory pools.

PUD THP solves these limitations by integrating 1GB pages into the existing THP infrastructure.”

The patch series benchmarks are looking very promising:

“Benchmark results of these patches on Intel Xeon Platinum 8321HC:

Test: True Random Memory Access [1] test of 4GB memory region with pointer chasing workload (4M random pointer dereferences through memory):

| Metric | PUD THP (1GB) | PMD THP (2MB) | Change |
|——————-|—————|—————|————–|
| Memory access | 88 ms | 134 ms | 34% faster |
| Page fault time | 898 ms | 331 ms | 2.7x slower |

Page faulting 1G pages is 2.7x slower (Allocating 1G pages is hard :)). For long-running workloads this will be a one-off cost, and the 34% improvement in access latency provides significant benefit.”

While promising, other upstream kernel developers have questioned some elements of it and caught by surprise with the new patch series. Oracle engineer Lorenzo Stoakes for example commented:

“OK so this is somewhat unexpected :)

It would have been nice to discuss it in the THP cabal or at a conference etc. so we could discuss approaches ahead of time. Communication is important, especially with major changes like this.

And PUD THP is especially problematic in that it requires pages that the page allocator can’t give us, presumably you’re doing something with CMA and… it’s a whole kettle of fish.

It’s also complicated by the fact we _already_ support it in the DAX, VFIO cases but it’s kinda a weird sorta special case that we need to keep supporting.

There’s questions about how this will interact with khugepaged, MADV_COLLAPSE, mTHP (and really I want to see Nico’s series land before we really consider this).

So overall, I want to be very cautious and SLOW here. So let’s please not drop the RFC tag until David and I are ok with that?

Also the THP code base is in _dire_ need of rework, and I don’t really want to add major new features without us paying down some technical debt, to be honest.

So let’s proceed with caution, and treat this as a very early bit of experimental code.”

We’ll see where this 1G THP work heads from here over the coming months.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Leave a Reply