ScaleMatrix and Nvidia's New AI and HPC Appliance Doesn't Need a Data Center
AI Anywhere models crunch 8 petaFLOPs and 13 petaFLOPs inside self-sufficient racks, thanks to proprietary cooling tech.
November 19, 2019
Just two months ago we wrote, "One does not simply buy a supercomputer." But in the rapidly changing world of IT, what was true in September may no longer be true in November.
Today at SC19, the supercomputing conference being held in Denver, data center provider ScaleMatrix introduced appliances that can deliver up to 13 petaFLOPS of performance. With cooling built in, they're plug-and-play out of the box and don't need to be housed in a specially designed data center.
This could be a game changer. Most high performance computing systems required for running machine learning and other AI workloads can't be located in a typical data center without major modifications to the facility's power distribution and cooling system. Being GPU-intensive, HPC systems can bring rack density up to about 30kW per rack, at least five times higher than the average data center load of 3kW to 5kW per rack.
But ScaleMatrix's new appliance is self-sufficient.
"All we need is a roof, floor space, and a place to plug the appliance in, and we can turn on an enterprise-class data center capable of supporting a significant artificial intelligence or high performance computing workload," Chris Orlando, ScaleMatrix's co-founder and CEO, told DCK.
Called "AI Anywhere," the product was developed in a three-way collaboration between ScaleMatrix, which operates high-density colocation data centers for AI workloads in Houston and San Diego, chipmaker Nvidia, and Microway, a provider of computer clusters, servers, and workstations for HPC and AI. They're available in two single-rack versions, each employing one of Nvidia’s two DGX supercomputer models, designed specifically for machine learning and AI workloads.
One model contains 13 DGX-1 units, delivering a payload of 13 petaFLOPS, with the other containing four DGX-2 systems, delivering 8 petaFLOPS. Both units adhere to DGX-POD reference architecture designs (Nvidia's design for building GPU-accelerated AI devices) and include the full NVIDIA DGX software stack, deep learning and AI framework containers, NetApp ONTAP storage, and Mellanox switching.
"Any enterprise that wants to be a supercomputing enterprise could have never imagined deploying the scale of infrastructure that they can support now with this solution," Tony Paikeday, director of product marketing of AI and deep learning at Nvidia, told us. "Prior to this they would have needed an AI-ready data center, the kind of facility that is optimized for the power and cooling demand of these accelerated computing systems. Now you can literally drop a supercomputing facility in places that would have been unimaginable before."
The secret sauce that makes these plug-and-play supercomputers possible is in the cooling.
With the DGX-1 version consuming 42kW, and the DGX-2 version running at 43kW, there's more heat being generated than most well-equipped data centers can handle coming out of a single rack. The appliances use ScaleMatrix's proprietary closed-loop chilled water-assisted forced-air cooling system -- the same design cooling ScaleMatrix's data centers -- with chilled water circulating through racks designed and built by the ScaleMatrix subsidary DDC.
A separate "micro-chiller" unit sits next to the rack, cooling water for the AI Anywhere systems.
Not only can this hybrid air-and-water approach cool efficiently -- it uses in-rack sensors to direct cool air where it's needed -- it does so without the risk associated with bringing liquid directly to silicon, Orlando said.
"Where the water comes in and out and does the thermal exchange, that area is sealed off from the rest of the cabinet," he said, "so we're bringing all the efficiency of water cooling to the cabinet without introducing any of the risk."
And the design can cool much higher densities than those of the two DGX-based solutions, he said. DDC recently introduced a rack that can handle up to 85kW using the same cooling system.
According to Paikeday, AI Anywhere addresses a need that Nvidia has been observing for a while.
"Customers are deploying larger and larger infrastructures to either tangle with more complex AI problems like natural language processing, or they're doing a consolidation play of trying to take stranded AI platform investments, kind of like 'Shadow AI,' that are sprawling across their enterprise and bring them under one roof," he said. "The question that inevitably comes back from most of these customers is, I'd love to do this but I'd have to have a data center and I'm getting out of the data center business -- I'm not trying to put more CapEx back into my data center.
"So this is now a perfect way to remove that last-mile barrier of how to get this kind of computing power into their hands."
The devices will be marketed only as AI Anywhere and won't carry the logo of any of the partners as a master brand, although the individual components used in the appliance will branded.
"The cabinets will be branded DDC. Microway is the delivery and services partner, and Nvidia, NetApp, and Mellanox infrastructure will each have their own logos. The ScaleMatrix cabinet exterior will be marked with AI Anywhere," ScaleMatrix said in response to our query.
About the Author
You May Also Like