AI Hardware Landlords
Nvidia enlists colo providers around the world to house (and cool) its customers’ supercomputers.
One does not simply buy a supercomputer.
Even if you already have the data center space and the money to pay for a supercomputer to train Deep Learning models, not every data center is designed to support the kind of AI hardware that’s on the market today.
Helping companies that want to use its DGX supercomputers for AI solve the data center puzzle, Nvidia earlier this year launched a referral program to help them find colocation providers that have facilities ready to power and cool these GPU-stuffed beasts. The program started back then with 10 providers, all US-based. In July, Nvidia expanded it internationally, with data center companies now chomping at the bit to host Nvidia’s AI hardware on customers’ behalf in 24 countries.
The reason a program like this is necessary is the unusually high amount of power systems like Nvidia’s (and other AI hardware vendors’) draw per square foot of data center floor. In a typical enterprise data center supporting typical enterprise workloads running on typical x86 servers, power densities per rack are in the 3kW to 5kW range (give or take). DGX-1 and DGX-2 both use about 1kW per rack unit, one data center operator that hosts these systems told DCK earlier this year. A single DGX-1 takes up three rack units, while a single DGX-2 takes up 10, and the operator, Colovore, commonly deals with densities north of 30kW per rack for customers that use the hardware.
Customers’ ability to power and cool the machines hasn’t been a sales roadblock for Nvidia, but it has been a “conversation point” ever since the company rolled the first-generation product out about three years ago, Charlie Boyle, senior director of DGX systems at the company, told Data Center Knowledge in an interview. When DGX was new to the market, and customers were buying just a few systems at a time, supporting them in their existing data centers wasn’t an issue. But when some of them decided to scale their AI infrastructure, they’d have to choose between rearchitecting their data centers or going to a colo provider that has the right infrastructure at the ready.
Boyle said he’d seen customers choose either option, but even at companies with “great data center facilities, lots of times the business side would call up the data center and say, ‘Oh, I need four 30kW racks.’ The data center team would say, ‘Great, we can do that, but that’s six months.’ Whereas we can go to one of our colo partners, and they can get it next week.”
Skyrocketing over the last few years, AI is an entirely new class of computing workloads. As more and more companies learn to train and deploy their AI models make these workloads part of their normal course of business, the trend is likely to become a major source of growth for colocation providers. Which explains why they’ve been keen to join a referral program like Nvidia’s.
Colo providers with data centers that are physically close to enterprise data centers have an advantage in capturing the AI hardware business. As Boyle explained to us, it is best to keep computing infrastructure for AI right next to where the data used to train the models is stored.
“In an AI context, you need a massive amount of data to get a result,” he said. “If all that data is already in your enterprise data center, and you’ve got 10, 20, 30 years of historical data, you want to move the processing as close to that as possible.” Nvidia, for example, uses colocation data centers around its headquarters in Santa Clara, California, to house its own AI hardware.
This “data gravity” is also the reason companies that already have most or all of their data in a cloud are likely to use their cloud provider’s AI services instead of buying their own hardware.
Nvidia’s list of “DGX-Ready Data Center” providers includes the biggest players, such as Equinix, Digital Realty Trust, Cyxtera, Interxion, and NTT Communications, but also lesser-known high-density colocation specialists, such as Colovore, Core Scientific, and ScaleMatrix. In total, 12 providers participate in the program in the Americas, four in Europe, and seven in Asia, with Equinix and Cyxtera participating across all three regions.
Not every data center in all participating companies’ portfolios can support high-density DGX deployments. Only five of Flexential’s 41 data centers in the US, for example, could do it when we asked the company earlier this year, after Nvidia first announced the program.
There are multiple ways to cool high-density compute. To date, liquid-cooled rear-door heat exchangers have been the go-to technology for most customers, Boyle said. It doesn’t require radical design changes to data center infrastructure or any changes to the hardware.
Flexential has managed to cool 35kW racks of DGXs using traditional raised-floor air cooling, the company’s chief innovation officer Jason Carolan told us. However, it isolates air-intake aisles with doors on either side, “creating a bathtub of cold air.” Speaking with DCK earlier this year, Carolan said the company had yet to see a customer deployment large enough to require liquid cooling but was ready to go there if needed.
Asked whether he expects DGX or other GPU-powered AI hardware to reach densities that rule out rear-door heat exchangers and require direct-to-chip liquid cooling, Boyle said not in the foreseeable future. There are already OEM-designed systems using Nvidia GPUs that are cooled with direct-to-chip liquid coolant. Nvidia’s own workstation version of the DGX, called DGX Station, is also liquid-cooled, he pointed out. At least today, the decision to go direct-liquid is usually driven by individual form-factor needs and data center capabilities, not chip or hardware design.
Ultimately, “there’s nothing inherent in what we need to do in density that says we can’t do air or mixed-air for the foreseeable future, mainly because most people would be limited by how much physical power they can put in a rack,” Boyle said. Instead of 30kW and 40kW racks that are common today, you could theoretically have 100kW and 200kW racks, “but nobody has that density today.”
This article originally appeared in the AFCOM Journal, available exclusively to AFCOM members.
About the Author
You May Also Like