Too much vRAM and too many Instinct accelerators per server is causing system hibernation to fail on some high-end AMD AI Linux-powered servers. Having eight accelerators each with 192GB of device memory can in turn cause system hibernation to run into problems if the Linux server has only 2TB of system RAM… But a new patch series was posted today in working to address this problem with the Linux kernel for high-end systems failing to hibernate. A similar issue is that when thawing the system the process can take nearly one hour do to the amount of memory.
AMD engineer Samuel Zhang explained the current behavior of Linux servers potentially running into hibernation issues if there is too much vRAM due to the hibernation process trying to evict that memory to GTT or shared memory. In some situations two copies of the vRAM contents could be made to system RAM and in turn exhausting all of the system memory.
Samuel Zhang explained on today’s Linux patch series working to address the hibernation issue within the Linux kernel:
“Modern data center dGPUs are usually equipped with very large VRAM. On server with such dGPUs(192GB VRAM * 8) and 2TB system memory, hibernate will fail due to no enough free memory.
The root cause is that during hibernation all VRAM memory get evicted to GTT or shmem. In both case, it is in system memory and kernel will try to copy the pages to hibernation image. In the worst case, this causes 2 copies of VRAM memory in system memory, 2TB is not enough for the hibernation image. 192GB * 8 * 2 = 3TB > 2TB.
The fix includes following 2 changes. With 2 changes, there’s much less pages needed to be copied to hibernate image and hibernation can succeed.
1. move GTT to shmem after evicting VRAM. then the GTT pages can be freed.
2. force write shmem pages to swap disk and free shmem pages.After swapout GTT to shmem in hibernation prepare stage, swapin and restore BOs in thaw stage takes lots of time (50 mintues observed for 8 dGPUs). And it’s not necessary since the follow-up hibernate stages do not use GPU for hibernation successful case. The third patch is just skip the BOs restore in thaw stage to reduce the hibernation time.”
Granted, most high-end accelerator-powered/AI servers are in use constantly, but for those wanting to hibernate them during downtime for reducing power consumption, this is apparently a real problem in play. Besides exhausting the system memory, the other issue at hand is the possibility of taking nearly an hour for swapping in and restoring buffer objects in the GPU memory when taking the system out of hibernation.
These patches affecting the Linux power management code as well as the AMDGPU kernel driver are now under review for hopefully making it into the mainline kernel in a future kernel cycle.