Compiler profile guided optimization (PGO) techniques have paid off well for increasing CPU performance via application/workload-specific profiles fed back to the compiler to make more informed decisions. AMD compiler engineers have been working on crafting device-side PGO for their AMDGPU LLVM back-end for allowing ROCm/HIP workloads to achieve greater GPU performance. An initial merge request is now open for upstream LLVM.
AMD engineer Sam Liu opened the LLVM merge request for supporting offload profiling with an initial focus on a uniformity-aware optimization with the AMDGPU back-end. The focus is on HIP/AMDGPU workloads for profile-guided compiler optimizations of GPU kernels.
He explained their work at length within this LLVM Discourse RFC published minutes ago in seeking feedback from the upstream LLVM developer community.
“This RFC proposes device-side Profile Guided Optimization (PGO) for HIP/AMDGPU, enabling profile-guided compiler optimizations for GPU kernels.
The key contributions are:
Device PGO infrastructure – instrumentation, profile collection, and consumption pipeline for AMDGPU device code, using only standard HIP APIs (no CLR patches required).
Uniformity-aware PGO – a safety mechanism that detects whether branches are uniform (all threads take the same path) or divergent at runtime, and gates certain optimizations accordingly.
The uniformity detection is essential because GPU execution follows the SIMT (Single Instruction, Multiple Threads) model, where standard CPU PGO assumptions about “cold” code paths do not hold. Without this safeguard, PGO-guided optimizations like spill placement can cause performance regressions on divergent branches.”
The RFC thread goes on to provide an overview of the traditional challenges in applying compiler PGO techniques for GPUs rather than CPUs, different use-cases, HIPRTC for workload-adaptive optimizations, and also applying the PGO techniques to static HIP applications. A lengthy and technical read for those interested in compiler internals.
Meanwhile this is the LLVM pull request for the initial code:
Key features:
– Wave-aggregated counter increments to reduce atomic contention
– Per-TU contiguous counter allocation to avoid linker reordering issues
– Uniformity detection to identify wave-uniform vs divergent branches
– Uniformity-aware spill placement to prevent PGO regressions on GPUsThe uniformity detection is critical because standard PGO can cause severe performance regressions on GPUs. When PGO moves register spills to “cold” paths, but those paths are entered divergently (different threads take different paths), partial-wave memory accesses cause poor coalescing and up to 3.7x slowdown. By detecting uniformity at profile collection time and gating spill placement decisions, we achieve:
– 12-14% speedup on uniform branches
– No regression on divergent branches (gating prevents the issue)
Promising so far and will be exciting to see how this PGO work pans out for AMD ROCm/HIP.
