It’s the software, dumbass The year is coming to an end and AMD had hoped that its powerful new MI300X AI chips would finally help it gain ground on Nvidia. But an extensive study by SemiAnalysis suggests that the company’s software challenges are allowing Nvidia to maintain its comfortable lead.
SemiAnalysis pitted AMD’s Instinct MI300X against Nvidia’s H100 and H200, with several differences observed between the chips. For the uninitiated, the MI300X is a GPU accelerator based on the AMD CDNA 3 architecture and is designed for high-performance computing, especially AI workloads.
On paper, the performance numbers seem excellent for AMD: the chip offers 1,307 TeraFLOPS of FP16 computing power and a whopping 192 GB of HBM3 memory, surpassing both of Nvidia’s competing offerings. AMD’s solutions also promise a lower total cost of ownership compared to Nvidia’s expensive chips and InfiniBand networks.
But as the SemiAnalysis crew discovered after five months of rigorous testing, raw specs aren’t the whole story. Despite the MI300X’s impressive silicon, AMD’s software ecosystem required significant effort to use effectively. SemiAnalysis relied heavily on AMD engineers to continually fix bugs and issues during their benchmarking and testing.
This is a far cry from Nvidia’s hardware and software, which they noted tend to run smoothly out of the box without requiring any hands from Nvidia staff.
Furthermore, the software issues weren’t just limited to SemiAnalysis’s testing; AMD’s customers also felt the pain. For example, AMD’s largest cloud provider Tensorwave had to give AMD engineers access to the same MI300X chips that Tensorwave had purchased so that AMD could debug the software.
Also read: Not just the hardware: how deep is Nvidia’s software moat?
The problems don’t stop there. From integration issues with PyTorch to subpar scaling across multiple chips, AMD’s software has consistently lagged behind Nvidia’s proven CUDA ecosystem. SemiAnalysis also noted that many AMD AI libraries are essentially forks of Nvidia AI libraries, leading to suboptimal results and compatibility issues.
“The CUDA moat has yet to be crossed by AMD due to AMD’s weaker than expected software Quality Assurance (QA) culture and challenging out-of-the-box experience. As quickly as AMD tries to fill in the CUDA moat, Nvidia engineers are working overtime to deepen this issue with new features, libraries and performance updates,” reads an excerpt from the analysis.
The analysts did find a glimmer of hope in the pre-release BF16 development branches for the MI300X software, which showed much better performance. But by the time the code goes into production, Nvidia will likely have the next generation of Blackwell chips available (although Nvidia is reportedly having some growing pains with that rollout).
Taking these issues into account, SemiAnalysis has made a number of recommendations to AMD, starting with giving Team Red engineers more computing and engineering resources to repair and improve the ecosystem.
Meet with @LisaSu Today we went through everything for 1.5 hours
She acknowledged the holes in the AMD software stack
She took our specific recommendations seriously
She asked her team and us a lot of questions
Many changes are already coming!
I’m glad to see improvements https://t.co/38aAwwIdEI– Dylan Patel (@dylan522p) December 23, 2024
SemiAnalysis founder Dylan Patel even met with AMD CEO Lisa Su. He posted to X that she understands how much work is needed to improve AMD’s software stack. He added that many changes are already in development.
However, it’s an uphill climb after years of seemingly neglecting this crucial component. As much as analysts want AMD to legitimately compete with Nvidia, the “CUDA moat” appears to be keeping Nvidia firmly in the lead for now.