Longtime AMDGPU driver engineer Alex Deucher has posted an interesting set of patches on Wednesday for enhancing the GPU reset experience under Linux with RDNA graphics cards.
Alex posted a set of patches for improving the GPU reset path for RDNA1 (GFX10) and newer AMD graphics processors. With these patches the per-queue reset support is enhanced so ultimately only the process putting the GPU in the bad state causing the GPU reset is affected. For games running on Linux the code has also been tested for allowing the games to resume after a queue reset.
Alex Deucher explained with this set of 10 patches now under review:
“This set improves per queue reset support for GC10+. This enables the legacy enforce isolation behavior to serialize access to GC for kernel queues so that only one process uses the queue at a time. When we reset the queue, only that process is effected which improves the user experience when a queue is reset. This mirrors how windows handles per queue resets. Tested on GC 10 and 11 chips with a game running and then running hang tests. The game pauses when the hang happens, then continues after the queue reset.
I tried this same approach and GC8 and 9, but it was not as reliable as soft recovery.”
It’s too late for getting these per-queue reset improvements tidied up for the upcoming Linux v6.16 merge window but hopefully they’ll manage to make it to the mainline kernel later in the year for enhancing the Radeon/RDNA reset experience.