For me, it all began last November with a post from an old friend on LinkedIn, expressing how impressed they were with ChatGPT. After eventually signing up myself, what truly captivated me was its ability to provide human-like answers that were both contextually appropriate and technically sound.
Its limitations were also clear of course – almost like interacting with an intelligent but slightly dull human friend. It would respond with bullet-pointed answers and consistently remind me that it was, in fact, an AI model. It urged me to take its responses with a grain of skepticism. What I found most appealing was the way the answers appeared on the screen—each letter and word appearing slowly, as if typed by a human on the other end of the connection.
Fast forward six months, and now when I type a question for ChatGPT, it responds so rapidly that it leaves me a bit dizzy. What transpired during these past six months? What changes were implemented by the creators of ChatGPT?
Most likely, OpenAI has scaled the inference capacity of their AI cluster to accommodate the demands of over 100 million subscribers. NVIDIA, who is leading the way when it comes to AI chip makers, is reported to have supplied around 20,000 graphic processing unit (GPUs) to support the development of ChatGPT. Moreover, there plans for significantly increased GPU usage, with speculation that their upcoming AI model may require as many as 10 million GPUs.
GPU cluster architecture – the foundation of generative AI
Now, let's take a step back. Wrapping my head around the concept of 20,000 GPUs is manageable, but the thought of optically connecting 10 million GPUs to perform intelligent tasks is quite the challenge.
After a couple of hours of scouring the internet, I stumbled upon various design guides detailing how to build high performance networks that provide the high-speed connectivity required for AI workloads.
Let’s discuss how we can create GPU clusters by initially configuring smaller setups and then gradually expanding them to incorporate thousands of GPUs. We’ll use NVIDIA design guidelines as the example here, which are rooted in the tradition of high-performance computing (HPC) networks.
According to their recommendations, the process involves constructing substantial GPU clusters using smaller units of 256 GPU pods (scalable units). Each pod consists of 8 compute racks and 2 middle-of-the-row networking racks. The connection within and in between these pods is established through InfiniBand, a high-speed, low-latency switching protocol, employing NVIDIA’s Quantum-2 switches.
Current InfiniBand switches utilise 800G OSFP ports, employing dual 400G next data rate (NDR) ports. This configuration uses 8 fibres per port, resulting in 64x400G ports per switch. It's highly likely that the forthcoming generation of switches, whatever name they carry, will adopt extreme data rate (XDR) speeds. This translates to 64x800G ports per switch, also utilising 8 fibres per port – mostly single mode fibre. This 4-lane (8-fibre) pattern seems to be a recurring motif in the InfiniBand roadmap, summarised in Table-1, utilising even faster speeds in the future.
When it comes to the cabling approach, the prevailing best practice in the high-performance computing (HPC) world entails employing point-to-point active optical cables (AOCs). These cables establish a robust connection between optical transceivers, with an optical cable linking the two.
However, with the introduction of the latest 800G NDR ports sporting multifibre push-on (MPO) optical connector interfaces, the landscape has shifted from AOC cables to MPO-MPO passive patch cords for point-to-point connections. When considering a single 256 GPU pod, utilising point-to-point connections poses no significant issues. My personal approach would be to opt for MPO jumpers for a more streamlined setup.
Operating at scale
Things remain relatively smooth up to this point, but challenges emerge when aiming for a larger
scale – for example 16k GPUs which will require interconnecting 64 of these 256 GPU pods – due to the rail-optimised nature of compute fabric used for these high-performance GPU clusters. In a rail-optimised setup, all host channel adapters (HCAs) from each compute system are connected to the same leaf switch.
This set-up is said to be vital for maximising deep learning (DL) training performance in a multi-job environment. A typical H100 compute node is equipped with 4x dual-port QSFP, translating to 8 uplink ports – one independent uplink per GPU – that connect to eight distinct leaf switches, thereby establishing an 8-rails-optimised fabric.
This design works seamlessly when dealing with a single pod featuring 256 GPUs. But what if the goal is to construct a fabric containing 16,384 GPUs? In such a scenario, two additional layers of switching become necessary. The first leaf switch from each pod connects to each switch in spine group one (SG1), while the second leaf switch within each pod links to each switch in SG2, and so forth. To achieve a fully realised fat-tree topology, a third layer of core switching group (CG) must be integrated.
Let's revisit the numbers for a 16,384 GPU cluster once more. Establishing connections between compute nodes and leaf switches (8 per pod) requires 16,384 cables, meaning 256 MPO patch cords per pod. As we embark on the journey of expanding our network, the task of establishing leaf-spine and spine-core connections becomes more challenging. This involves the initial bundling of multiple point-to-point MPO patch cords, which are then pulled across distances ranging from 50 to 500 meters.
Could there be a more efficient approach to our operations? One suggestion could be to employ a structured cabling system with a two-patch panel design, utilising high fibre count MPO trunks, perhaps 144 fibres. This way, we can consolidate 18 MPO patch cords (1) (18x8=144) into a single Base-8 trunk cable (3). This consolidated cable can be pulled all at once through the data hall. By utilising patch panels suitable for 8-fibre connectivity (4) and MPO adapter panels (2) at the endpoints, we can then break them out and connect them to our rail-optimised fabric. This method eliminates the need to deal with bundling numerous MPO patch cords.
A two patch panel interconnect design example featuring a 32-fibre trunk (organised as 4x8), along with two patch panels, adaptor panels and patch cables.
To illustrate, consider the scenario where 256 uplinks are required from each pod for an unblocking fabric. We can opt for pulling 15x144 fibre trunks from each pod, resulting in 15x18=270 uplinks. Remarkably, this can be achieved using just 15 cable jackets. Additionally, this setup offers 270-256=14 spare connections, which can serve as backups or even be utilised for storage or management network connections.
Ultimately, AI has made significant progress in comprehending our questions, and we’ll witness its continued evolution. When it comes to enabling this transition, seeking cabling solutions that can support extensive GPU clusters—whether they comprise 16K or 24K GPUs – is an important part of the puzzle and a challenge that the optical connectivity industry is already rising to meet.