By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: AMD Squeezing Out More More ROCm/HIP Performance With New Device-Side PGO
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > AMD Squeezing Out More More ROCm/HIP Performance With New Device-Side PGO
Computing

AMD Squeezing Out More More ROCm/HIP Performance With New Device-Side PGO

News Room
Last updated: 2026/01/26 at 3:30 PM
News Room Published 26 January 2026
Share
AMD Squeezing Out More More ROCm/HIP Performance With New Device-Side PGO
SHARE

Compiler profile guided optimization (PGO) techniques have paid off well for increasing CPU performance via application/workload-specific profiles fed back to the compiler to make more informed decisions. AMD compiler engineers have been working on crafting device-side PGO for their AMDGPU LLVM back-end for allowing ROCm/HIP workloads to achieve greater GPU performance. An initial merge request is now open for upstream LLVM.

AMD engineer Sam Liu opened the LLVM merge request for supporting offload profiling with an initial focus on a uniformity-aware optimization with the AMDGPU back-end. The focus is on HIP/AMDGPU workloads for profile-guided compiler optimizations of GPU kernels.

He explained their work at length within this LLVM Discourse RFC published minutes ago in seeking feedback from the upstream LLVM developer community.

“This RFC proposes device-side Profile Guided Optimization (PGO) for HIP/AMDGPU, enabling profile-guided compiler optimizations for GPU kernels.

The key contributions are:

Device PGO infrastructure – instrumentation, profile collection, and consumption pipeline for AMDGPU device code, using only standard HIP APIs (no CLR patches required).

Uniformity-aware PGO – a safety mechanism that detects whether branches are uniform (all threads take the same path) or divergent at runtime, and gates certain optimizations accordingly.

The uniformity detection is essential because GPU execution follows the SIMT (Single Instruction, Multiple Threads) model, where standard CPU PGO assumptions about “cold” code paths do not hold. Without this safeguard, PGO-guided optimizations like spill placement can cause performance regressions on divergent branches.”

The RFC thread goes on to provide an overview of the traditional challenges in applying compiler PGO techniques for GPUs rather than CPUs, different use-cases, HIPRTC for workload-adaptive optimizations, and also applying the PGO techniques to static HIP applications. A lengthy and technical read for those interested in compiler internals.

AMD HIP PGO RFC

Meanwhile this is the LLVM pull request for the initial code:

Key features:

– Wave-aggregated counter increments to reduce atomic contention

– Per-TU contiguous counter allocation to avoid linker reordering issues

– Uniformity detection to identify wave-uniform vs divergent branches

– Uniformity-aware spill placement to prevent PGO regressions on GPUs

The uniformity detection is critical because standard PGO can cause severe performance regressions on GPUs. When PGO moves register spills to “cold” paths, but those paths are entered divergently (different threads take different paths), partial-wave memory accesses cause poor coalescing and up to 3.7x slowdown. By detecting uniformity at profile collection time and gating spill placement decisions, we achieve:

– 12-14% speedup on uniform branches

– No regression on divergent branches (gating prevents the issue)

Promising so far and will be exciting to see how this PGO work pans out for AMD ROCm/HIP.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Do you know what ‘catch-farts’ are? Hilarious 1600s London slang book revealed Do you know what ‘catch-farts’ are? Hilarious 1600s London slang book revealed
Next Article Apple’s latest AirTag upgrade puts pressure back on Android rivals Apple’s latest AirTag upgrade puts pressure back on Android rivals
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Judge strikes down Trump freeze on EV charger funds
Judge strikes down Trump freeze on EV charger funds
News
Apple TV adds Richard Gere to upcoming limited series – 9to5Mac
Apple TV adds Richard Gere to upcoming limited series – 9to5Mac
News
Countdown to XIN Summit 2025 in Shenzhen — Only 20 Booths Left at the Global Hub for Smart Hardware Innovation! · TechNode
Countdown to XIN Summit 2025 in Shenzhen — Only 20 Booths Left at the Global Hub for Smart Hardware Innovation! · TechNode
Computing
Gemini in Google Calendar is getting so good at scheduling meetings, interns will be out of work
Gemini in Google Calendar is getting so good at scheduling meetings, interns will be out of work
News

You Might also Like

Countdown to XIN Summit 2025 in Shenzhen — Only 20 Booths Left at the Global Hub for Smart Hardware Innovation! · TechNode
Computing

Countdown to XIN Summit 2025 in Shenzhen — Only 20 Booths Left at the Global Hub for Smart Hardware Innovation! · TechNode

3 Min Read
Data Pipeline Testing: The 3 Levels Most Teams Miss | HackerNoon
Computing

Data Pipeline Testing: The 3 Levels Most Teams Miss | HackerNoon

8 Min Read
NIO claims battery swaps surpass 90 million, daily swaps exceed 100,000 · TechNode
Computing

NIO claims battery swaps surpass 90 million, daily swaps exceed 100,000 · TechNode

1 Min Read
Inside Neuralink’s Technology Architecture: Hype or Near-Term Reality? | HackerNoon
Computing

Inside Neuralink’s Technology Architecture: Hype or Near-Term Reality? | HackerNoon

1 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?