Merged today for Mesa 25.3-devel to benefit the RADV Vulkan and RadeonSI Gallium3D AMD drivers are improved scheduling heuristics for the ACO compiler back-end developed by Valve.
The ACO compiler can now enjoy improved scheduling heuristics to help with performance on newer AMD Radeon graphics processors. The existing ACO scheduling heuristics were catering to aging Polaris GPUs while now the code is better adapted for more recent GPUs.
Daniel Schürmann opened the merge request nearly one year ago to improve the scheduling heuristic for ACO. Finally this morning it made it into Mesa Git.
Schürmann explains in the merge request:
The ACO scheduling heuristic stems from the era of dinosaurs, more precisely the Polaris family, and wasn’t touched since.
small introduction: Given the instruction sequence of a shader, the ACO scheduler works by gradually moving up memory load instructions until it cannot find any other independent instructions to move down or until the register pressure exceeds certain predetermined limits (more on that later). Then, it tries to move down the first use of the loaded value, so that the distance between load and use is as high as possible. Of course, there is lots of small corner cases and small adjustments for how multiple memory loads are ordered with each other, but that is the rough idea (and there is no plans to ever change that).
GPUs are different: On most modern GPUs, the register file is shared between hardware-threads (aka waves or warps), meaning that the more registers some shader uses, the less instances of that shader can run in parallel and vice-versa. While using more registers lowers the occupancy, it can also improve the execution time of a single shader and reduces the likelihood of cache trashing (which means that different shaders evict each others’ data from the cache). So, how many registers should we use?
ACO is unique: On CPUs, the concept of occupancy doesn’t usually exist which means that existing compilers rarely care either. When developing ACO, much emphasis was put into the ability to predetermine a desired occupancy by being able to schedule within fixed register limits and avoiding additional spilling (keyword: SSA-based register allocation). Currently, when determining the desired occupancy, the only information we take into account is the occupancy of the shader before scheduling happened. We then might allow a lower occupancy to give some room for improved scheduling.
This rewrite aims to detangle some concepts and provide more consistent results.
– wave_factor: The purpose of this value is to reflect that RDNA SIMDs can accomodate twice as many waves as GCN SIMDs.
– reg_file_multiple: This value accounts for the larger register file of wave32 and some RDNA3 families.
– wave_minimum: Below this value, we don’t sacrifice any waves. It corresponds to a register demand of 64 VGPRs in wave64.
– occupancy_factor: Depending on target_waves and wave_factor, this controls the scheduling window sizes and number of moves.The main differences from the previous heuristic is a lower wave minimum and a slightly less aggressive reduction of waves.
It also increases SMEM_MAX_MOVES in order to mitigate some of the changes from targeting less waves.
This will hopefully yield at least some minor performance gains in the real-world for AMD Linux gamers. The impact though can vary game to game.
More details on this big fundamental improvement for the ACO compiler code via this MR. This will be part of Mesa 25.3 due out in Q4 so there still is time for further optimizations to the ACO and RADV driver code.