Nvidia Corp.’s upcoming new Blackwell graphics processing units, which have already been delayed to market, are now reportedly encountering overheating issues that prevent their deployment in data center racks.
The report today by The Information says customers have raised serious concerns about the issue, worrying that it will affect their plans for building out their new data center infrastructure for artificial intelligence.
The problem is that the Blackwell GPUs seem to overheat when connected together in data center server racks designed to hold up to 72 chips at once. The Information cited sources familiar with the issue as saying that when the chips are integrated into Nvidia’s customized server racks, they produce excessive heat that can result in operational inefficiencies or even hardware damage.
Nvidia has reportedly told its suppliers to alter the design of their racks several times to try and resolve the overheating problems, but without success. The Information did not name the suppliers involved.
In response to the report, Nvidia downplayed the issue. “Nvidia is working with leading cloud service providers as an integral part of our engineering team and process,” a spokesperson for the company told Reuters today. “The engineering iterations are normal and expected.”
Nvidia first announced Blackwell in March, as the successor to the hugely successful H100 GPUs that are used to power the majority of the world’s AI applications today. They’re said to deliver a 30-times performance boost versus the H100 chips while reducing energy consumption by up to 25% on some workloads.
The company had initially planned to ship the Blackwell chips in the second half of this year, but its plans came unstuck when a design flaw was revealed, causing the launch date to be pushed back to early 2025.
One of the key innovations in Blackwell is that it merges two silicon squares, each the size of the company’s H100 chip, into a single component. This is the crucial advancement that enables the chip to process AI workloads much faster, enabling more rapid data processing.
The original problem reportedly had something to do with the processor die that connects those two silicon squares, but Nvidia Chief Executive Jensen Huang said during a visit to Denmark last month that the issue had been resolved with assistance from its manufacturing partner, Taiwan Semiconductor Manufacturing Co.
It’s not immediately clear if the new overheating issues will affect Blackwell’s new launch date, slated for early next year, but Nvidia has every incentive to ensure that it gets the product just right. The GB200 Grace Blackwell superchips are set to cost up to $70,000 a piece, while a complete server rack is priced at more than $3 million.
Nvidia has previously said it hopes to sell around 60,000 to 70,000 complete servers, so any more delays could be extremely expensive for the company, which has become one of the most valuable publicly traded entities in the world due to its dominance of the AI industry. It’s set to report quarterly earnings results Wednesday.
For customers, the fear is that any delay would affect their data center infrastructure deployment plans and potentially hurt their ability to develop more advanced AI models and applications.
Image: Nvidia
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU