Crashes on NPUs and AI accelerators are unfortunately a thing and yet another obstacle to worry about it with modern computing. Qualcomm developers have sent out patches for Sub-System Restart “SSR” functionality for their Qualcomm AI Accelerator (QAIC) driver for Linux to handle restarts when workload crashes occur on their AI accelerator hardware.
Qualcomm open-source developer Youssef Samir explained of this SSR functionality for their QAIC driver that is used by the likes of their Cloud AI 100 and Cloud AI 200 products. He wrote on the patch cover letter:
“SSR is a feature that mitigates a crash in device sub-system. Usually, after a workload (using a sub-system) has crashed on the device, the entire device crashes affecting all the workloads on device. SSR is used to limit the damage only to that particular workload and releases the resources used by it, leaving the decision to the user. Applications are informed when SSR starts and ends via udev notifications. All ongoing requests for that particular workload will be lost.”
The user in turn is responsible for any recover with SSR just providing the necessary restart of the hardware and propagating notifications as to the hardware state.
These QAIC accelerator driver patches for SSR are now under review for inclusion into a future version of the Linux kernel.