By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: AMD Squeezing Out More More ROCm/HIP Performance With New Device-Side PGO
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > AMD Squeezing Out More More ROCm/HIP Performance With New Device-Side PGO
Computing

AMD Squeezing Out More More ROCm/HIP Performance With New Device-Side PGO

News Room
Last updated: 2026/01/26 at 3:30 PM
News Room Published 26 January 2026
Share
AMD Squeezing Out More More ROCm/HIP Performance With New Device-Side PGO
SHARE

Compiler profile guided optimization (PGO) techniques have paid off well for increasing CPU performance via application/workload-specific profiles fed back to the compiler to make more informed decisions. AMD compiler engineers have been working on crafting device-side PGO for their AMDGPU LLVM back-end for allowing ROCm/HIP workloads to achieve greater GPU performance. An initial merge request is now open for upstream LLVM.

AMD engineer Sam Liu opened the LLVM merge request for supporting offload profiling with an initial focus on a uniformity-aware optimization with the AMDGPU back-end. The focus is on HIP/AMDGPU workloads for profile-guided compiler optimizations of GPU kernels.

He explained their work at length within this LLVM Discourse RFC published minutes ago in seeking feedback from the upstream LLVM developer community.

“This RFC proposes device-side Profile Guided Optimization (PGO) for HIP/AMDGPU, enabling profile-guided compiler optimizations for GPU kernels.

The key contributions are:

Device PGO infrastructure – instrumentation, profile collection, and consumption pipeline for AMDGPU device code, using only standard HIP APIs (no CLR patches required).

Uniformity-aware PGO – a safety mechanism that detects whether branches are uniform (all threads take the same path) or divergent at runtime, and gates certain optimizations accordingly.

The uniformity detection is essential because GPU execution follows the SIMT (Single Instruction, Multiple Threads) model, where standard CPU PGO assumptions about “cold” code paths do not hold. Without this safeguard, PGO-guided optimizations like spill placement can cause performance regressions on divergent branches.”

The RFC thread goes on to provide an overview of the traditional challenges in applying compiler PGO techniques for GPUs rather than CPUs, different use-cases, HIPRTC for workload-adaptive optimizations, and also applying the PGO techniques to static HIP applications. A lengthy and technical read for those interested in compiler internals.

AMD HIP PGO RFC

Meanwhile this is the LLVM pull request for the initial code:

Key features:

– Wave-aggregated counter increments to reduce atomic contention

– Per-TU contiguous counter allocation to avoid linker reordering issues

– Uniformity detection to identify wave-uniform vs divergent branches

– Uniformity-aware spill placement to prevent PGO regressions on GPUs

The uniformity detection is critical because standard PGO can cause severe performance regressions on GPUs. When PGO moves register spills to “cold” paths, but those paths are entered divergently (different threads take different paths), partial-wave memory accesses cause poor coalescing and up to 3.7x slowdown. By detecting uniformity at profile collection time and gating spill placement decisions, we achieve:

– 12-14% speedup on uniform branches

– No regression on divergent branches (gating prevents the issue)

Promising so far and will be exciting to see how this PGO work pans out for AMD ROCm/HIP.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Do you know what ‘catch-farts’ are? Hilarious 1600s London slang book revealed Do you know what ‘catch-farts’ are? Hilarious 1600s London slang book revealed
Next Article Apple’s latest AirTag upgrade puts pressure back on Android rivals Apple’s latest AirTag upgrade puts pressure back on Android rivals
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

How a Creator Subscription Platform Helps You Build Sustainable Fan Revenue
How a Creator Subscription Platform Helps You Build Sustainable Fan Revenue
Gadget
Apple’s luxe AirPods Max have dropped to one of their lowest prices ever
Apple’s luxe AirPods Max have dropped to one of their lowest prices ever
News
How Akshatha Madapura Anantharamu Is Building Trustworthy Interfaces for AI Systems | HackerNoon
How Akshatha Madapura Anantharamu Is Building Trustworthy Interfaces for AI Systems | HackerNoon
Computing
Vinod Khosla publicly disavows Keith Rabois’ comments on ICE shooting |  News
Vinod Khosla publicly disavows Keith Rabois’ comments on ICE shooting | News
News

You Might also Like

How Akshatha Madapura Anantharamu Is Building Trustworthy Interfaces for AI Systems | HackerNoon
Computing

How Akshatha Madapura Anantharamu Is Building Trustworthy Interfaces for AI Systems | HackerNoon

5 Min Read
Rad Power Bikes asset auction attracts two successful bidders as part of e-bike maker’s bankruptcy
Computing

Rad Power Bikes asset auction attracts two successful bidders as part of e-bike maker’s bankruptcy

4 Min Read
Xiaomi CEO Lei Jun’s live stream suspended on Douyin for streaming while driving · TechNode
Computing

Xiaomi CEO Lei Jun’s live stream suspended on Douyin for streaming while driving · TechNode

1 Min Read
Swift Concurrency: Part 3 — Bridging Legacy APIs with Continuations | HackerNoon
Computing

Swift Concurrency: Part 3 — Bridging Legacy APIs with Continuations | HackerNoon

8 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?