AWS launches Trainium2 instances, teases next-gen custom silicon

December 05, 2024 08:00 AM

Amazon Web Services announced the general availability of instances powered by its custom Trainium2 and teased its next-generation custom silicon: Trainium3.

Showcased at AWS re:Invent 2024, the new Trainium2-powered instances offer 30 to 40% better price performance compared to the hyperscaler’s GPU-based EC2 P5e and P5en instances.

Each instance features 16 Trainium2 chips and is capable of providing 20.8 peak petaflops of compute, making them more than capable of powering training workloads for large language models.

Subscribe today for free

The connectivity news and insights that matter - straight to your inbox

“With models approaching trillions of parameters, we understand customers also need a novel approach to train and run these massive workloads,” said David Brown, VP of compute and networking at AWS. “Trainium2 is purpose-built to support the largest, most cutting-edge generative AI workloads, for both training and inference and to deliver the best price-performance on AWS.”

Alongside individual instances, AWS showcased the new EC2 Trn2 UltraServers — which feature 64 interconnected Trainium2 chips, offering up to 83.2 peak petaflops of compute.

The UltraServers quadruple the computing, memory, and networking of a single instance, with AWS suggesting they’re more than capable of handling some of the world’s largest AI models.

“New Trn2 UltraServers offer the fastest training and inference performance on AWS and help organisations of all sizes to train and deploy the world’s largest models faster and at a lower cost,” Brown added.

Alongside Claude developer Anthropic, AWS is building an “UltraCluster” made up of the Trn2 UltraServers, named Project Rainier.

Featuring hundreds of thousands of interconnected Trainium2 chips, Project Rainer will act as an AI training powerhouse, with Anthropic using it to train its flagship AI models, including the new Claude Sonnet 3.5 (.1).

When completed, AWS claimed it will be the world’s largest AI compute cluster built especially for Anthropic to build and deploy future models. Currently, Elon Musk’s xAI is attempting the same feat with its Colossus cluster, with plans underway to double its 100,000 Nvidia GPUs.

Alongside Trainium2, AWS used re:Invent to provide a glimpse at its next-generation custom silicon, Trainium3.

The hyperscaler didn’t provide too much in the way of details but revealed it will be its first custom chip made with a three-nanometre process node.

AWS claimed that the eventual Trainium3-powered UltraServers will be four times more performant than Trn2 UltraServers, with the first instances expected to be available in late 2025.

The company said the future Trainium3 will “allow customers to iterate even faster when building models and deliver superior real-time performance when deploying them”.