By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: AMD Squeezing Out More More ROCm/HIP Performance With New Device-Side PGO
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > AMD Squeezing Out More More ROCm/HIP Performance With New Device-Side PGO
Computing

AMD Squeezing Out More More ROCm/HIP Performance With New Device-Side PGO

News Room
Last updated: 2026/01/26 at 3:30 PM
News Room Published 26 January 2026
Share
AMD Squeezing Out More More ROCm/HIP Performance With New Device-Side PGO
SHARE

Compiler profile guided optimization (PGO) techniques have paid off well for increasing CPU performance via application/workload-specific profiles fed back to the compiler to make more informed decisions. AMD compiler engineers have been working on crafting device-side PGO for their AMDGPU LLVM back-end for allowing ROCm/HIP workloads to achieve greater GPU performance. An initial merge request is now open for upstream LLVM.

AMD engineer Sam Liu opened the LLVM merge request for supporting offload profiling with an initial focus on a uniformity-aware optimization with the AMDGPU back-end. The focus is on HIP/AMDGPU workloads for profile-guided compiler optimizations of GPU kernels.

He explained their work at length within this LLVM Discourse RFC published minutes ago in seeking feedback from the upstream LLVM developer community.

“This RFC proposes device-side Profile Guided Optimization (PGO) for HIP/AMDGPU, enabling profile-guided compiler optimizations for GPU kernels.

The key contributions are:

Device PGO infrastructure – instrumentation, profile collection, and consumption pipeline for AMDGPU device code, using only standard HIP APIs (no CLR patches required).

Uniformity-aware PGO – a safety mechanism that detects whether branches are uniform (all threads take the same path) or divergent at runtime, and gates certain optimizations accordingly.

The uniformity detection is essential because GPU execution follows the SIMT (Single Instruction, Multiple Threads) model, where standard CPU PGO assumptions about “cold” code paths do not hold. Without this safeguard, PGO-guided optimizations like spill placement can cause performance regressions on divergent branches.”

The RFC thread goes on to provide an overview of the traditional challenges in applying compiler PGO techniques for GPUs rather than CPUs, different use-cases, HIPRTC for workload-adaptive optimizations, and also applying the PGO techniques to static HIP applications. A lengthy and technical read for those interested in compiler internals.

AMD HIP PGO RFC

Meanwhile this is the LLVM pull request for the initial code:

Key features:

– Wave-aggregated counter increments to reduce atomic contention

– Per-TU contiguous counter allocation to avoid linker reordering issues

– Uniformity detection to identify wave-uniform vs divergent branches

– Uniformity-aware spill placement to prevent PGO regressions on GPUs

The uniformity detection is critical because standard PGO can cause severe performance regressions on GPUs. When PGO moves register spills to “cold” paths, but those paths are entered divergently (different threads take different paths), partial-wave memory accesses cause poor coalescing and up to 3.7x slowdown. By detecting uniformity at profile collection time and gating spill placement decisions, we achieve:

– 12-14% speedup on uniform branches

– No regression on divergent branches (gating prevents the issue)

Promising so far and will be exciting to see how this PGO work pans out for AMD ROCm/HIP.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Do you know what ‘catch-farts’ are? Hilarious 1600s London slang book revealed Do you know what ‘catch-farts’ are? Hilarious 1600s London slang book revealed
Next Article Apple’s latest AirTag upgrade puts pressure back on Android rivals Apple’s latest AirTag upgrade puts pressure back on Android rivals
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Emergency Patch Issued for Microsoft Office, 365 Over Hacking Threat
Emergency Patch Issued for Microsoft Office, 365 Over Hacking Threat
News
TikTokers are heading to UpScrolled following US takeover
TikTokers are heading to UpScrolled following US takeover
News
AI Coding Tip 004 – Why You Should Use Modular Skills | HackerNoon
AI Coding Tip 004 – Why You Should Use Modular Skills | HackerNoon
Computing
Apple launched its first new product of the year, and you’ll love the upgrades
Apple launched its first new product of the year, and you’ll love the upgrades
News

You Might also Like

AI Coding Tip 004 – Why You Should Use Modular Skills | HackerNoon
Computing

AI Coding Tip 004 – Why You Should Use Modular Skills | HackerNoon

6 Min Read
After new funding, Noetix Robotics explains how it built a humanoid robot cheaper than an iPhone · TechNode
Computing

After new funding, Noetix Robotics explains how it built a humanoid robot cheaper than an iPhone · TechNode

3 Min Read
The Report Was Perfect. The Decision Cost Us Millions. | HackerNoon
Computing

The Report Was Perfect. The Decision Cost Us Millions. | HackerNoon

7 Min Read
Countdown to XIN Summit 2025 in Shenzhen — Only 20 Booths Left at the Global Hub for Smart Hardware Innovation! · TechNode
Computing

Countdown to XIN Summit 2025 in Shenzhen — Only 20 Booths Left at the Global Hub for Smart Hardware Innovation! · TechNode

3 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?