One of the core Linux infrastructure improvements that AMD engineers have been working on recently is pghot as a hot-page tracking and promotion subsystem. This proposed addition to the Linux kernel could be quite beneficial especially for those using modern AMD EPYC servers with CXL and multiple memory tiers.
AMD engineer Bharata Rao today posted the latest request for comments (RFC) patches implementing this pghot concept for the Linux kernel. This hot page tracking infrastructure aims to unify hot page detection from multiple sources, centralize the hot page promotion logic, and consolidate the different areas of the Linux kernel tracking page accesses independently.
Per the patch cover letter for how pghot works:
“- Tracks frequency and last access time.
– Additionally, the accessing NUMA node ID (NID) for each recorded access is also tracked in the precision mode.
– These hotness parameters are maintained in a per-PFN hotness record within the existing mem_section data structure.
– In default mode, one byte (u8) is used for hotness record. 5 bits are used to store time and bucketing scheme is used to represent a total access time up to 4s with HZ=1000. Default toptier NID (0) is used as the target for promotion which can be changed via debugfs tunable.
– In precision mode, 4 bytes (u32) are used for each hotness record. 14 bits are used to store time which can represent around 16s with HZ=1000.
– Classifies pages as hot based on configurable thresholds.
– Pages classified as hot are marked as ready for migration using the ready bit. Both modes use MSB of the hotness record as ready bit.
– Per-lower-tier-node kmigrated threads periodically scan the PFNs of lower-tier nodes, checking for the migration-ready bit to perform batched migrations. Interval between successive scans and batching value are configurable via debugfs tunables.”
The pghot patches are out on the Linux kernel mailing list.
For those just wanting to know the net result of pghot, in testing on an AMD EPYC Zen 5 server with two CPU NUMA nodes and a CXL node, there indeed were time savings in benchmarks both in looking at scenarios of page promotion and when the top-tier memory is over-committed leading to a mix of page promotion and demotion.
