While late in the Linux 6.16 cycle and hitting the cut-off for when the period to queue new DRM driver feature material for Linux 6.17 ends, an additional drm-misc-next pull request was sent out today with some last minute kernel graphics driver changes for this next kernel cycle. Motivating this extra pull were the recent AMDGPU system hibernation patches.
The headline change with today’s drm-misc-next pull is incorporating the AMD patches to reduce system memory requirements for hibernation on large AI/GPU servers. The patches and issue were previously covered on Phoronix within AMD Instinct Accelerators With So Much vRAM Have Exposed Linux Hibernation Issues.
With the latest AMD Instinct accelerators able to see 192GB of device memory and having up to eight of them per server, all that device memory is causing issues with the AMDGPU driver during hibernation. In some cases it’s causing issues for not enough free system memory when creating the hibernation image and when it does succeed it’s taking a long time due to all the archiving and then restoring of the buffer objects.
Besides the possibility of hibernation failing if not enough system memory, when everything does otherwise go right it takes an awfully long time:
“For normal hibernation, GPU do not need to be resumed in thaw since it is not involved in writing the hibernation image. Skip resume in this case can reduce the hibernation time.
On VM with 8 * 192GB VRAM dGPUs, 98% VRAM usage and 1.7TB system memory, this can save 50 minutes.”
Nearly one hour can be saved with these patches on a maxed out AMD Instinct accelerator server.
Those patches to overhaul the AMDGPU hibernation handling are part of today’s drm-misc-next pull request and what motivated this extra pull. Also contained in this pull request for Linux 6.17 are some memory leak fixes to different pieces of code, scheduler improvements for the Nouveau driver, Sitronix ST7567 support, BOE NE14QDM panel support, and other last minute changes.