AWS Launches Trainium2 Custom AI Chip, Data Center Upgrades
The cloud provider announced the arrival of Trainium2-powered cloud services, the forthcoming Trainium3, and new data center improvements.
Table of Contents:
Amazon Web Services (AWS) has announced that its latest custom AI chip, Trainium2, is now available through two new cloud services for training and deploying large AI models, the company said today (Tuesday, Dec. 3).
At its AWS re:Invent conference in Las Vegas, AWS said its new Amazon Elastic Compute Cloud (EC2) Trn2 instances, featuring 16 Trainium2 chips, provide 20.8 peak petaflops of compute, making it ideal for training and deploying large language models (LLMs) with billions of parameters.
AWS also introduced a new EC2 offering, EC2 Trn2 UltraServers, which features 64 interconnected Trainium2 chips and scales up to 83.2 peak petaflops of compute, which makes it possible to train and deploy the world’s largest AI models, the company said.
The hyperscale cloud provider is also collaborating with Anthropic, the creator of the Claude LLM, to build an EC2 cluster of Trn2 UltraServers that will contain hundreds of thousands of Trainium2 chips – and allow Anthropic to build and deploy its future models on. The effort, called Project Rainier, will provide Anthropic five times more exaflops than it used to train its current AI models, AWS said.
Peter DeSantis, SVP of AWS Utility Computing, during his re:Invent keynote address. The new EC2 Trn2 UltraServers are pictured behind him. Credit: AWS
AWS today also announced plans for its next-generation AI chip, the Trainium3, which is expected to be twice more performant and 40% more energy efficient than the Trainium2, said Gadi Hutt, senior director of product and customer engineering at AWS’ Annapurna Labs. The 3-nanometer Trainium3 will be available in late 2025.
Analysts Give Their Take on the Trainium2
With its custom AI chip announcements today, AWS beefs up its AI offerings and provides a new low-cost alternative to Nvidia’s GPUs. Analysts said AWS has the potential to attract customers to its new Trainium2 services as enterprises increasingly adopt AI.
“I think that’s going to be the catalyst that causes customers to look at Trainium2 as an alternative, especially when they’re price sensitive.”
Gartner analyst Jim Hare said some AI workloads can run on CPUs. Many AI workloads require GPUs from the likes of Nvidia, which AWS supports. But Trainium2 – which provides better performance and is more energy efficient than AWS’ first-generation Trainium chip – provides AWS customers another option because of its price performance benefits, he said.
AWS, which announced plans to build Trainium2 one year ago, said its new Trainium2-powered EC2 Trn2 instances provide 30% to 40% better price performance than the current generation of GPU-based EC2 instances.
“Customers naturally think they would go to a GPU for anything AI, but as customers move from experimenting with AI, where they think, ‘This is great. Look what I can do with AI’ to ‘How do I deploy this at scale, and do it in a much more cost-effective way,’ more customers will be open to looking at alternatives,” Hare told Data Center Knowledge.
“Trainium2 is going to give better price performance,” Hare added. “I think that’s going to be the catalyst that causes customers to look at Trainium2 as an alternative, especially when they’re price sensitive.”
Analyst Matt Kimball of Moor Insights & Strategy said the Trn2 instances delivering 20.8 petaflops of peak performance puts it in a competitive position with Nvidia and AMD GPUs. And Trn2 UltraServers’ ability to deliver more than 80 petaflops of peak performance make them a good option for large model training, he said.
For some enterprise organizations, AWS’ project with Anthropic will validate Trainium2 as a viable alternative for AI training, Kimball said. Some enterprises who previously disregarded AWS’ in-house AI chip because it wasn’t from Nvidia may give it a closer look, he said.
“As silly as this may sound, many enterprise organizations are more conservative in their adoption of new technologies, so great chips like Trainium get overlooked because they are not from the company that has been dubbed, ‘the godfather of AI’ for the last year,” Kimball said. “This partnership tells those IT organizations that not only is Trainium – as a brand, and Trainium2 as a chip – legitimate, it’s supporting some of the most demanding AI needs in the industry as Anthropic chases OpenAI.”
Competitive Landscape in the Cloud and AWS’ Chip Strategy
AWS and its cloud competitors Google Cloud and Microsoft Azure all partner with large chipmakers Nvidia, AMD and Intel – and provide services powered by their processors. But the three cloud giants also find it advantageous and cost-effective to build their own custom chips.
All three cloud providers, for example, have built their own in-house CPUs for general workloads and in-house AI accelerators for AI training and inferencing services.
AWS’ chip strategy is to give customers many choices, said AWS’ Hutt, in an interview. AWS launched its first-generation Trainium chip for AI training in 2022 and made available Inferentia2, its second-generation AI inferencing chip, in 2023.
In addition to offering the new Trainium2-powered EC2 services, the company also offers multiple EC2 instances that support Nvidia GPUs and one EC2 instance that supports an Intel Gaudi accelerator.
Credit: TechCrunch
The upshot: Trainium2 customers will enjoy high performance and the lowest cost for their workloads, Hutt said. Trainium2 is designed to support training and deployment of frontier LLM, multimodal and computer vision models, he added.
“We are all about giving customers choice,” Hutt said. “Customers that have workloads that fit GPUs might choose GPUs. Customers that want to have the best price performance from their chips choose Trainium/Inferentia.”
For example, with Trainium2, Anthropic’s Claude Haiku 3.5 LLM gets a 60% boost in speed compared to other chip alternatives, he said.
AWS Announces New Data Center Infrastructure Innovations
At re:Invent on Monday, AWS also announced new data center infrastructure improvements in power, cooling and hardware design that will better support AI workloads and improve resiliency and energy efficiency.
AWS said new data center improvements include a more efficient cooling system that includes installing liquid cooling and reducing fans, which will result in a 46% reduction of mechanical energy consumption. AWS also said backup generators will be able to run on renewable diesel, which will cut down on greenhouse gas emissions.
To support high-density AI workloads, AWS said it has developed engineering innovations that enable it to support a six times increase in rack power density over the next two years. That is delivered, in part, by a new power shelf that efficiently delivers data center power throughout a rack, according to AWS.
New AI servers will also benefit from liquid cooling to more efficiently cool high-density chips such as Trainium2 and AI supercomputing solutions like Nvidia GB200 NVL72, the company said.
“We have used only a very small amount (of liquid cooling in the past),” Kevin Miller, AWS’ vice president of global data centers, told Data Center Knowledge. “But we are now at the stage where we’re beginning to rapidly increase the amount of liquid cooling capacity we are deploying.”
AWS has also improved automation in its control systems to improve resiliency. The control systems, software that monitors components within each data center, can more quickly troubleshoot problems to prevent downtime or other issues, he said.
“In some cases, manual troubleshooting efforts that would have taken hours (in the past) now happens within two seconds because our software is automatically looking at all the sensors, making decisions and then taking corrective action,” Miller said.
Miller said AWS has already installed these new innovations, which AWS calls “data center components,” in some AWS data centers. AWS will continue to install these new data center components in new and existing data centers moving forward, he said.
IDC Analyst Vladimir Kroa said AWS’ data center improvements are significant because they enable resiliency and improved operational and energy efficiency.
“What is powerful is not any one single component. To make a real impact, it’s a combination of all of them,” Kroa said.
About the Author
You May Also Like