NVIDIA’s Tesla P100 GPU — the world’s first AI supercomputer data centre GPU — is being used by researchers in early warning systems during extreme weather events. When and why did this happen?
In September 2020, the Swiss National Supercomputing Centre (CSCS) and MeteoSwiss, Switzerland’s national meteorology office, used NVIDIA’s Tesla P100 GPU to create the world’s first artificially intelligent supercomputer data centre.
It boosted the COSMO atmospheric model which had been reliant on central processing units (CPUs). In the words of Christoph Schär, a climate science professor at ETH Zurich, the new supercomputer “renders the calculations more efficient, faster and lower cost”.
Using COSMO, researchers have moved climate understanding forward with an impressive leap. They generated climate projections for most of Europe — including areas in Scandinavia, the Mediterranean and Africa — at an ultra-precise 2.2km resolution. This ultra-fine resolution exposed previously invisible patterns, such as the fact that the intensity of hourly rainfall rises by 7% for each degree of temperature increase. That flood warning evidence is sufficient to persuade governments to take action that will change millions of lives for the better.
The meteorological forecast is a calculation aided by billions of sensors all around the earth. They generate petabytes of information that only a supercomputer could process on time.
Meteorological forecast quality depends on model complexity and high resolution, which in turn is affected by the performance of supercomputers. The supercomputer is reliant on interconnect technology to move data across its complex resources.
Data centres used in weather research use high performance computing (HPC) clusters of hundreds or thousands of CPUs with voluminous, hyper-speedy storage systems. To complete the performance equation, however, these data centres must flash this data between servers and storage systems to prevent the possibility of any processing queues. Given the sub-millisecond time limits on transfer this calls for hypersonic capacity in the data transfer conduits. Mellanox is constantly pushing the speed limits of data transport across its InfiniBand range. Which is why its adapters, switches, cables and software are used to create the conditions for the lowest possible latency, which in turn should support the highest productivity.
To use a climate analogy, data is the processing unit’s oxygen and a multi-core, multi-processor server can “inhale and exhale” 200 gigabytes of data every second. That input/output (I/O) performance has to be matched reliably and without loss at sub-600ns latency if data movement bottlenecks are to be avoided. Supercomputers enable the weather centres and Mellanox’s infrastructure is the fertile soil on which the supercomputer can thrive.
NVIDIA Networking has also provided this technology to scientists working on climate change forecasts. What does that work involve?
Indeed, we are working with many of the world’s leading meteorological services which have chosen NVIDIA Networking’s HDR InfiniBand to accelerate their supercomputing platforms, including the Spanish Meteorological Agency, the China Meteorological Administration, the Finnish Meteorological Institute, NASA and the Royal Netherlands Meteorological Institute, the Beijing Meteorological Service, and Meteo France, the French national meteorological service.
The design of InfiniBand rests on four fundamentals: a smart endpoint design that can run all network engines; a software-defined switch network designed for scale; centralized management that lets the network be controlled and operated from a single place; and standard technology, ensuring forward and backward compatibility, with support for open source technology and open APIs.
It’s these fundamentals that help InfiniBand provide the highest network performance, extremely low latency and high message rate. As the only 200Gbps high-speed interconnect in the market today, InfiniBand delivers the highest network efficiency with advanced end-to-end adaptive routing, congestion control and quality of service.
HDR InfiniBand will also accelerate the new supercomputer for the European Centre for Medium Range Weather Forecasts (ECMWF). Being deployed this year, the system will support weather forecasting and prediction researchers from over 30 countries across Europe. It will increase the centre’s weather and climate research compute power by five times, making it one of the world’s most powerful meteorological supercomputers.
The new platform will enable running nearly twice as many higher-resolution probabilistic weather forecasts in less than an hour, improving the ability to monitor and predict increasingly severe weather phenomena and enabling European countries to better protect lives and property.
Weather and climate models are both compute and data intensive. Forecast quality depends on model complexity and high resolution. Resolution depends on the performance of supercomputers. And supercomputer performance depends on interconnect technology to move data quickly, effectively and in a scalable manner across compute resources.
As a networking firm, why did you choose to engage in/initiate these projects?
From an ethical standpoint, it was an obvious course of action that aligns with our core values.
We chose the path of sustainability and, in some way, sustainability chose us because most of the world’s leading meteorological services wanted NVIDIA Mellanox InfiniBand technology. They needed turbo-charged networking of their supercomputing platforms. InfiniBand is the accelerator of choice for the Spanish Meteorological Agency, the China Meteorological Administration, the Finnish Meteorological Institute, NASA and the Royal Netherlands Meteorological Institute.
The performance of InfiniBand made it the de facto standard for climate research and weather forecasting applications, delivering higher performance, scalability and resilience than all other interconnect technologies.
The Beijing Meteorological Service asked NVIDIA to supply 200 gigabit HDR InfiniBand interconnect technology to expedite its new supercomputing platform in readiness for the 2022 Winter Olympics in Beijing. The French national meteorological service, Meteo, has demanded HDR InfiniBand to catalyse the reaction between its two new large-scale supercomputers.
This exemplifies how vital the role of networking is in the super computing team. Connectivity isn’t not just about computing, it’s about continuity.
Turning now to the sustainability of high performance computing, what is NVIDIA doing in this area?
Supercomputing is an energy intense activity. According to MIT, just training one neural network has the same carbon footprint as running five car engines. It takes immense power to run the processors and almost as much to run the cooling systems.
Traditionally, the only benchmark of success for a super computer was its FLOP rating, floating point operations per second (FLOPS). That was a useful insight into the number of calculations but no indication of the environmental cost.
Processor speed need not be the only benchmark of productivity, said NVIDIA’s research, so it designed its GPUs to work smarter with emphasis on being reliable, usable and available. These help the supercomputer to work more effectively without a negative impact on the environment. NVIDIA’s Selene is second in the Green500, an independently compiled ranking of energy efficiency. It is the top ranked supercomputer to be commercially available.
Selene delivers 20.52 gigaFLOPS per watt and is based around a unique type of open infrastructure, the DGX SuperPOD. Designed and built in just a couple of weeks, the DGX SuperPOD combines NVIDIA’s DGX processor design with an AI networking fabric from Mellanox.
It’s this configuration that gives Selene its performance, efficiency and economy, as well as the variety of uses it lends itself to.
What efficiencies have been achieved to date?
Selene delivers 20.52 gigaFLOPS per watt, which makes it the most energy efficient commercially available supercomputing system in the world. It’s also the seventh most powerful, so as a standard bearer it is helping to embolden and advance the cause supercomputing sustainability
The foundation of Selene’s efficiency is its open infrastructure, the DGX SuperPOD, which invites contributions from the world’s experts in this field and keeps it open to efficiency improvements.
The DGX SuperPOD complements NVIDIA’s DGX processor design with an AI networking fabric from NVIDIA Networking (Mellanox). They create a culture of versatility that lends itself to tweaking performance, efficiency and economy in response to whatever job it is employed to do.
Since inception Selene has run thousands of jobs a week, often simultaneously. It conducts AI data analytics, traditional machine learning and HPC applications.
The DGX Station A100 is four times faster than the previous model and delivers three times the performance large AI training exercises, from the same carbon footprint.