Last month I wrote about new code slated to be added for Linux 6.15 that would provide a cross-driver/standardized means of reporting to user-space over hung GPUs. For the likes of the AMD and Intel graphics drivers initially, user-space will be notified via this new wedged event when a GPU is hung in case user-space wants to take additional actions to try to recover the GPU or at least properly note the troubled state of the GPU. There are now proposed patches under review for further extending this functionality.
The original use-case for this standardized way of reporting to user-space over hung GPUs was so that additional steps could be taken to address unresponsive hardware, such as having user-space unbind and rebind the kernel driver or resetting the bus device, etc. But moving forward this new GPU wedged event may become more useful.
KDE KWin developer Xaver Hugl has begun experimenting with this new functionality already has a draft KWin pull request for switching the renderer for Wayland sessions at run-time and better handling around “severe” GPU resets and commented:
“I experimented with using this in KWin, and [the proposed code] makes it fall back to a software renderer when a rebind is required to recover the GPU. Making it also survive the rebind properly is more challenging (current version of the MR doesn’t do it for you and just crashes if you do it with a udev rule or manually), but it’s doable – and not a problem of the API.
I’d really like to have the PID of the client that triggered the GPU reset, so that we can kill it if multiple resets are triggered in a row (or switch to software rendering if it’s KWin itself) and show a user-friendly notification about why their app(s) crashed, but that can be added later.”
André Almeida with consulting firm Igalia took that idea and posted a patch-set providing that application informaiton for the GPU wedge events. This patch series is proposing that “app info” addition so that Wayland compositors could take to displaying information in a user-friendly way about problematic apps that may be hanging the GPU or blocking them outright.
The app info option on wedge events would report the PID of the appending process as well as the application name in case the process is already dead. This can be used then for presenting the information in a user-friendly way or at least better logging by user-space when hung GPUs are encountered.
This looks like it will be a useful addition and hopefully it manages to make it across the finish line (upstream kernel) for further enhancing the wedged GPU events.