Authors:
(1) Simone Silvestri, Massachusetts Institute of Technology, Cambridge, MA, USA;
(2) Gregory Wagner, Massachusetts Institute of Technology, Cambridge, MA, USA;
(3) Christopher Hill, Massachusetts Institute of Technology, Cambridge, MA, USA;
(4) Matin Raayai Ardakani, Northeastern University, Boston, MA, USA;
(5) Johannes Blaschke, Lawrence Berkeley National Laboratory, Berkeley, CA, USA;
(6) Valentin Churavy, Massachusetts Institute of Technology, Cambridge, MA, USA;
(7) Jean-Michel Campin, Massachusetts Institute of Technology, Cambridge, MA, USA;
(8) Navid Constantinou, Australian National University, Canberra, ACT, Australia;
(9) Alan Edelman, Massachusetts Institute of Technology, Cambridge, MA, USA;
(10) John Marshall, Massachusetts Institute of Technology, Cambridge, MA, USA;
(11) Ali Ramadhan, Massachusetts Institute of Technology, Cambridge, MA, USA;
(12) Andre Souza, Massachusetts Institute of Technology, Cambridge, MA, USA;
(13) Raffaele Ferrari, Massachusetts Institute of Technology, Cambridge, MA, USA.
Table of Links
Abstract and 1 Justification
2 Performance Attributes
3 Overview of the Problem
4 Current State of the Art
5 Innovations
5.1 Starting from scratch with Julia
5.2 New numerical methods for finite volume fluid dynamics on the sphere
5.3 Optimization of ocean free surface dynamics for unprecedented GPU scalability
6 How performance was measured
7 Performance Results and 7.1 Scaling Results
7.2 Energy efficiency
8 Implications
9 Acknowledgments and References
6 How performance was measured
The Oceananigans model performance is estimated for two near-global ocean simulations with different domains: a realistic (R) domain and an aqua planet (AP) domain. Both domains span the entire longitudinal extent of the sphere and cover a latitude range of 75◦S to 75◦N.
The Realistic domain has realistic bathymetry and is forced by realistic surface momentum, heat, and salinity fluxes derived from the ECCO2 state estimate[31] at three resolutions:
• OceananigansR12 with 1/12th degree horizontal resolution (∼7 km) and 48 vertical levels
• OceananigansR24 with 1/24th degree horizontal resolution (∼3.4 km) and 100 vertical levels
• OceananigansR48 with 1/48th degree horizontal resolution (∼1.7 km) and 100 vertical levels
Figure 3 shows surface vertical vorticity after one year integration of OceananigansR12 and OceananigansR48 over the global ocean and also for selected regions to show further detail. Both OceananigansR48 and OceananigansR12 exhibit macroscale turbulent ocean features that are currently unresolved by most IPCC-class models. The OceananigansR48 solution exhibits fronts, filaments, and other “submesoscale” vorticity features realized only a handful of times in global simulations.
The idealized OceananigansAP suite of simulations [13], which has idealized bathymetry and surface forcing that does not require interpolation to different resolutions, is used for weak scaling experiments. All OceananigansAP experiments have 100 vertical levels and two latitudinal ridges that divide the world ocean into two basins. We vary the horizontal resolution of OceananigansAP from 1/6th of a degree (∼14 km) to 1/196th of a degree (∼488 m).
None of our simulations require explicit horizontal diffusion of momentum or tracers owing to the adaptive WENO advection scheme described in section 5.2. All simulations use a Richardson-number based parameterization for vertical mixing due to unresolved shear and convective turbulence at 1–100 m scales.
To assess the time-to-solution for each experiment in simulated years per day (SYPD), we measure average wall clock time per time-step. Wall clock time is sampled through NVIDIA’s Nsight System and recorded by NVIDIA Tool Extension Library via the NVTX.jl Julia package.
To assess the efficiency of each solution in simulated years per mega-watt-hour (SYPMWh), we combine SYPD with an estimate of the mean power draw over the duration of an experiment. On MIT Satori [2], which has 256 Nvidia V100s, we have access to precise, billing-grade power metering. For all simulations with Nvidia A100s we estimate power consumption P with
where D is the number of A100s and N is the number of nodes.
We further note that power estimates are provided by LICOM3 and COSMO, but not for LLC4320 or Veros. To estimate the power consumption of LLC4320, we assume that each of the 1000 dual CPU nodes draws 500W. We estimate the power consumption of iHESP CESM [51] and HadGCM3 [37] as a percentage of the peak power consumption of their respective clusters. We use equation (1) to estimate Veros’ power consumption on 1 node with 16 A100s.
7 Performance Results
We report both scaling results via time-to-solution in SYPD and efficiency results via energy-to-solution in SYPMWh.
7.1 Scaling Results
Realistic ocean simulations (Satori and Engaging clusters). We report strong scaling tests using the realistic global setup shown in figure 3 on two clusters: (i) the MIT Satori cluster [2], a high-performance Power 9 system composed of 64 Power 9 nodes hosting four Nvidia V100 GPUs with 32GBs memory each, and (ii) the Engaging MIT cluster, using 8 nodes that host 4 NVlinked A100s with 80GBs memory each. The resulting wall clock time per time step, averaged over 1500 time steps, is presented in Figure 4 for both single precision (FP32) and double precision (FP64) computations. On a single node, OceananigansR12 attains 0.9 SYPD in double precision and 1.4 SYPD in single precision, with a wall clock time per time step ranging from 330 to 550 ms. When increasing the number of nodes up to 16 (64 GPUs), the communication overhead increases, resulting in 12.4 SYPD in single precision and 7.75 SYPD in double precision. We measure a strong scaling efficiency of 52% in single precision and 55% in double precision over 64 GPUs, because the computational workload (40 ms wall clock time per time-step) eventually becomes too short to completely mask the communication overhead.
For higher-resolution ocean weather-permitting simulations, the scaling is almost ideal across the range we investigate. For OceananigansR24 (FP64-V100) and OceananigansR48 (FP32-V100), we measure larger than ideal scaling. This counter-intuitive result is a product of a load balance improvement as the number of GPUs increases. In summary, we attain 1.94 SYPD on 120 V100 GPUs with a kilometer-scale resolution (OceananigansR24) and 0.33 SYPD with an ocean weather resolving simulation (OceananigansR48). Finally, we have tested the OceananigansR48 setup on 144 Perlmutter nodes (576 A100 GPUs), reaching the 0.95 SYPD. This is the first instance of a kilometer-scale ocean achieving ∼1 SYPD. We have also tested the OceananigansR12 setup on 17 nodes obtaining 9.9 SYPD (see fig. 5).
Aqua-planet simulation (Perlmutter cluster). We report weak scaling tests on the NERSC supercomputer (Perlmutter). Perlmutter is a HPE (Hewlett Packard Enterprise) Cray EX super
computer that hosts four A100 GPUs with 40GB per node, linked through a NVLink3 interconnect. All weak scaling tests are performed using the OceananigansAP setup on double precision. We allocate two different horizontal resolutions (1/12 and 1/6 of a degree), progressively increasing them with the number of GPUs while maintaining 100 vertical levels. As shown in figure 5, we obtain 100% weak scaling efficiency for the whole investigated range (1 to 196 nodes – 4 to 768 A100s).