Microsoft's First AI Super Factory

Microsoft has unveiled its first AI super factory. A facility that connects large data centers in Wisconsin and Atlanta via a dedicated fiber network designed for high-speed training data transfer.

Big tech companies have embarked on the battle to dominate global AI and creating new infrastructure is a fundamental step in the face of the insatiable needs of this technology, in components such as accelerators, in high-performance networks or in energy supply.

Microsoft’s first AI super factory

Microsoft explains that this infrastructure design will support large AI workloads which differ from the smaller, more isolated tasks common in cloud environments. “It’s about building a distributed network that can act as a virtual supercomputer to address the world’s biggest challenges”explains Alistair Speirs, Microsoft general manager for Azure infrastructure.

“The reason we call it an AI super factory is that it runs a complex task on millions of devices… it’s not just a single site training an AI model, but a network of sites supporting that task.”. The AI WAN system transports information over thousands of kilometers using dedicated fiber, some newly built and some repurposed from previous acquisitions.

Protocols and network architecture They have been adjusted to shorten routes and keep data flowing with minimal delay. Microsoft says this allows remote sites to cooperate in the same model training process in near real time, with each location contributing its share of computing power. The goal is to maintain continuous activity on a large number of accelerator GPUs so that no unit stops while waiting for results from another location.

“Leading in AI is not just about adding more GPUs, but about building the infrastructure that makes them work together as a single system”explains Scott Guthrie, executive vice president of Cloud + AI at Microsoft. To do this, the company uses the Fairwater layout to support high-performance rack systems, including NVIDIA GB200 NVL72 units designed to scale to very large clusters of Blackwell GPUs.

The company combines this hardware with liquid cooling systems that send hot fluid out of the building and return it at lower temperatures. This cooling system operates practically does not use new waterexcept for the periodic replenishment necessary for chemical control, which solves another serious problem of modern data centers: water consumption.

The company presents this AI superfactory as a site designed specifically for training of advanced AI toolsciting the increasing number of parameters and larger training data sets as key pressures driving the expansion.

The Atlanta plant replicates the Wisconsin design, providing a consistent architecture across multiple regions as more facilities come online. And many more will be necessary: “The amount of infrastructure needed now to train these models is not just one data center, not two, but many more”they assure.