From Wisconsin to Atlanta: Microsoft connects datacenters to build its first AI superfactory

In Atlanta, Microsoft has flipped the switch on a new class of datacenter – one that doesn’t stand alone but joins a dedicated network of sites functioning as an AI superfactory to accelerate AI breakthroughs and train new models on a scale that has previously been impossible.

The new AI datacenter in Atlanta, which began operation in October, is the second in Microsoft’s Fairwater family. It shares the same architecture and design as the company’s recently announced investment in Wisconsin. But these aren’t simply isolated buildings densely packed with sophisticated silicon and cooling techniques that use almost zero water.

These Fairwater AI datacenters are directly connected to each other – and eventually to others under construction throughout the US – with a new type of dedicated network allowing data to flow between them extremely quickly. This enables Fairwater sites located in different states to work together as an AI superfactory to train new generations of AI models far more quickly, accomplishing jobs in just weeks instead of several months.

The network will connect multiple sites with hundreds of thousands of the most advanced GPUs running AI workloads, exabytes of storage and millions of CPU cores for operational compute tasks, which all work together to support OpenAI, the Microsoft AI Superintelligence Team, Copilot capabilities and other leading AI workloads.

“This is about building a distributed network that can act as a virtual supercomputer for tackling the world’s biggest challenges in ways that you just could not do in a single facility,” said Alistair Speirs, Microsoft general manager focusing on Azure infrastructure.

“A traditional datacenter is designed to run millions of separate applications for multiple customers,” he added. “The reason we call this an AI superfactory is it’s running one complex job across millions of pieces of hardware. And it’s not just a single site training an AI model, it’s a network of sites supporting that one job.”

Five server racks in the foreground and more in the background are connected above by many cables. — Information is shared among all GPUs at the Atlanta Fairwater site with very fast, high-throughput networking. Photo courtesy of Microsoft.

Microsoft’s new Fairwater AI datacenters have a unique design that differentiates them from previous generations. Atlanta, for instance, features:

a new chip and rack architecture that delivers the highest throughput per rack of any cloud platform available today
NVIDIA GB200 NVL72 rack-scale systems that can scale to hundreds of thousands of NVIDIA Blackwell GPUs
a two-story design that allows for greater GPU density
advanced liquid cooling that consumes almost zero water in its operations
intelligent networking, enabling fast communication among GPUs
a new dedicated network linking it to AI compute clusters at other sites

The design for Microsoft’s Fairwater sites has been informed by years of building ever more powerful, capable and efficient AI infrastructure. From the first supercomputer Microsoft developed in collaboration with OpenAI for large-scale training of AI models in 2019 to systems that trained subsequent OpenAI models, Microsoft has learned and improved upon each design. That includes refining, inventing or rethinking every layer of the infrastructure stack.

“Leading in AI isn’t just about adding more GPUs – it’s about building the infrastructure that makes them work together as one system,” said Scott Guthrie, Microsoft executive vice president of Cloud + AI.

“We’ve spent years advancing the architecture, software and networking needed to train the largest models reliably, so our customers can innovate with confidence. Fairwater reflects that end-to-end engineering and is designed to meet growing demand with real-world performance, not just theoretical capacity,” he said.

This new type of AI datacenter is designed to accelerate large-scale AI workloads and integrate with Microsoft’s global infrastructure fleet to provide fungibility across the AI lifecycle, including training of frontier models and inferencing, or usage of AI capabilities for customers around the world. 

The physical density of GPUs at Fairwater sites allows Microsoft to pack more compute power into a smaller footprint to reduce latency, or lag. Photo courtesy of Microsoft.

Purpose-built for AI

Microsoft’s Fairwater datacenters are built from the ground up to excel at one task: training and running new AI models. The exponential growth in the number of parameters – which determine how an AI model processes data and delivers answers – plus vastly greater amounts of training data require correspondingly greater computing resources.

Companies are developing increasingly sophisticated AI models that now power billions of chats a day, make workdays more efficient and help people make sense of vast amounts of business intelligence. They’re also improving the capacity to predict storms more accurately, spot patterns that improve medical treatment and invent new materials to solve energy challenges.

“To make improvements in the capabilities of the AI, you need to have larger and larger infrastructure to train it,” said Mark Russinovich, CTO, deputy CISO, and technical fellow, Microsoft Azure. “The amount of infrastructure required now to train these models is not just one datacenter, not two, but multiples of that.”

The distributed networking of Microsoft’s Fairwater sites is designed to enable them to support training models with hundreds of trillions of parameters. Moreover, that AI training is no longer a single, monolithic job. It now spans pre-training, fine-tuning, reinforcement learning, evaluation and synthetic data generation, each with unique requirements.

The novel level of connection starts inside each Fairwater datacenter, where hundreds of thousands of NVIDIA Blackwell graphics processing units, or GPUs – the kind of chip most used for AI – are interconnected. Each of those chips can talk to others and share memory within a specially designed 72-GPU server rack, and information is shared among all GPUs at the site with very fast, high-throughput networking.

Not only are all the chips at one site interconnected, but they are physically very close, both on the racks and within the building. Unlike many datacenters, the Fairwater design uses two stories. This allows Microsoft to pack more compute power into a smaller footprint to reduce latency, or lag. But it also brought new challenges in datacenter design that had to be solved, such as how to run the cables and coolant pipes, plus dealing with the weight of the second story.

An aerial shot shows the entrance to the immense datacenter, which has two stories. — The Fairwater AI datacenter design has two stories. Photo courtesy of Microsoft.

Connecting sites to work as an AI superfactory

As Microsoft brings more AI datacenters online, they will also be connected to each other with an AI Wide Area Network, or AI WAN, via dedicated fiber-optic cables. That allows the data to travel, congestion-free, at nearly the speed of light.

Some of the fiber-optic cables for the dedicated AI network have been built new; others were acquired by Microsoft years ago and repurposed. The company has deployed 120,000 miles of dedicated fiber for the network, increasing its overall mileage by more than 25% in one year.

Meanwhile, the software that directs the flow of data, called network protocol, has been fine-tuned, as well as the network architecture, to make connections as direct as possible.

The AI WAN network connects the chips and racks in one location to similar infrastructure many states away in a way that allows data to flow with minimal traffic bottlenecks that create latency. This allows multiple sites to cooperate on nearly real-time model training more efficiently.

Microsoft’s networking innovations optimize for low-latency connections both inside each site and across the network, with fiber that is fit for purpose.

Those fast connections are essential to training AI models, which requires a different approach than cloud datacenters that run many smaller, independent workloads such as hosting websites, email or business applications. Instead, Fairwater sites need to function as one, with hundreds of thousands of the latest NVIDIA GPUs working together as an AI superfactory on a massive compute job.

Each GPU sees a slice of the training data and does its computation, but it also needs to share the results of its computation with all the others. And then they all need to update the AI model at the same time. That means if any part of the process is bottlenecked or slow, it holds up the entire job as the rest of the GPUs sit idle, waiting for others to finish, Russinovich said. The goal of the Fairwater network is to keep GPUs busy at all times.

The infographic explains the special features of the Fairwater design. The six elements include:
High-speed AI backbone that bridges thousands of miles in milliseconds;
Multi-gigawatt campuses that maximize tokens per watt;
Hundreds of thousands of GPUs per region in unmatched density;
Innovative closed-loop cooling that recirculates without consuming extra water;
Shorter cables connecting thousands of racks so AI moves at the speed of light;
AI app-aware networking that directs traffic for maximum GPU performance.

If exchanging information quickly is essential, why build a second training site so far from the first one in Wisconsin? Because land and power availability make it more attractive – and at this point necessary – to spread the work across different physical locations.

“You really need to make it so that you can train across multiple regions, and nobody’s really run into that problem yet because they haven’t gotten to the scale we’re at now,” Russinovich said.

Making distant sites work as one required new networking technologies and entirely new infrastructure dedicated to the task – similar in principle to the carpool lane on a congested highway.

“The future of AI will be shaped by connecting datacenters into a unified, distributed system. By making our AI sites operate as one, we’re able to help our customers bring breakthrough models to life, deliver results that matter in the real world and empower them to solve challenges and create new opportunities,” Guthrie said.

Cutting-edge cooling

Inside a Fairwater datacenter, the density of the GPUs poses another challenge: heat. AI chips generate more heat than traditional chips. So Microsoft engineered a complex closed-loop cooling system for its Fairwater sites to take the hot liquid out of the building to be chilled and returned to the GPUs. This required a new configuration of pipes, pumps and chillers to meet the challenge of cooling such a big site. The water used in Fairwater Atlanta’s initial fill is equivalent to what 20 homes consume in a year and is replaced only if water chemistry indicates it is needed.

Every aspect of the Fairwater AI datacenters and networking innovations has been optimized to deliver AI computing power with the greatest efficiency and using the fewest resources. That continued innovation has been essential to continue to meet demand, Speirs said.

“We’ve got customers and infrastructure that are already training AI models at such a large scale,” he said. “We’ve really been the ones with the hard hats on, running through the hardest problems and breaking through brick walls.”

Top image: An aerial view of the Fairwater AI datacenter in Atlanta.

Purpose-built for AI

Connecting sites to work as an AI superfactory

Cutting-edge cooling

Tags: