Musk to double xAI’s Colossus cluster to 200K GPUs for Grok 3 training

October 29, 2024 02:27 PM

xAI staff including Elon Musk in front of the Colossus supercomputer cluster, containing 100,000 Nvidia H100 GPUs

Elon Musk has announced that xAI, his AI startup rivalling OpenAI, plans to double the size of its Colossus supercomputer cluster, which currently consists of 100,000 Nvidia GPUs.

Located in Memphis, Tennessee, the colossal cluster is among the largest supercomputing facilities in the world and will be expanded as xAI ramps up training for its upcoming Grok 3 model.

Subscribe today for free

The connectivity news and insights that matter - straight to your inbox

The current version of the cluster is made up of 100,000 Nvidia Hopper GPUs, leveraging Nvidia’s Spectrum-X ethernet networking platform to enhance data transfer speeds and reduce latency.

Musk confirmed in a post on X (Twitter) however that the cluster is set to increase in size to 200,000 GPUs and will feature Nvidia's upcoming H200 GPU, which launches later this year.

Soon to become a 200k H100/H200 training cluster in a single building https://t.co/2YvdmqXp1W
— Elon Musk (@elonmusk) October 28, 2024

“Colossus is the most powerful training system in the world,” Musk said. “Nice work by xAI team, Nvidia and our many partners/suppliers.”

XAI has been building out its infrastructure for some time, having previously relied on using servers from X and later Oracle Cloud to train its earlier Grok models.

However, Musk is keen to catch up with OpenAI and has been pushing for his rival startup to have its own dedicated infrastructure to power its AI training and inference workloads.

The supercomputing site in Memphis is being designed by Nvidia with the help of Dell and Supermicro. The facility was built in just 122 days in total, with 19 days between the first rack rolling onto the floor until training began.

The mammoth supercomputing cluster will now power training for xAI’s upcoming model, Grok 3, which Musk hinted over the summer could debut by year-end and may rival — or even surpass — the highly anticipated capabilities of OpenAI's GPT-5.

Notably, Musk has his sights on xAI getting holding of H200s for the Colossus cluster, rather than Nvidia’s next-generation Blackwell GPUs, of which shipments have been pushed back due to manufacturing issues. A design fix has since been implemented, with Jensen Huang recently placing the blame entirely on Nvidia.