Reducing communication in the conjugate gradient method

Karp, Martin; Jansson, Niclas; Podobas, Artur; Schlatter, Philipp

doi:10.1145/3539781.3539785

Cited by 6 publications

(5 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We list the peak FOM of the degree N = 15 tests in Table 2 wherein we see that when weak-scaled we observe 943.6 GFLOPS or higher on each NVIDIA Tesla V100, 1062.8 GFLOPS or higher on each AMD Instinct MI100, and 1287.1 GFLOPS or higher on each GCD of AMD Instinct MI250X. Comparing to other GPU performance values for NekBone in the literature, Karp et al (2020) used a version of NekBone with a native CUDA Poisson operator kernel to report ≈410 GFLOPS on a single NVIDIA Tesla V100 at degree N = 9. Figure 4(a) shows our hipBone benchmark exceeding this FLOP rate at the lower polynomial degree N = 7, achieving 657.6 GFLOPS a single NVIDIA Tesla V100 despite the lower arithmetic intensity.…”

Section: Computational Testsmentioning

confidence: 72%

See 1 more Smart Citation

HipBone: A performance-portable graphics processing unit-accelerated C++ version of the NekBone benchmark

Chalmers

Mishra

McDougall

et al. 2023

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

We present hipBone, an open-source performance-portable proxy application for the Nek5000 (and NekRS) computational fluid dynamics applications. HipBone is a fully GPU-accelerated C++ implementation of the original NekBone CPU proxy application with several novel algorithmic and implementation improvements which optimize its performance on modern fine-grain parallel GPU accelerators. Our optimizations include a conversion to store the degrees of freedom of the problem in assembled form in order to reduce the amount of data moved during the main iteration and a portable implementation of the main Poisson operator kernel. We demonstrate near-roofline performance of the operator kernel on three different modern GPU accelerators from two different vendors. We present a novel algorithm for splitting the application of the Poisson operator on GPUs which aggressively hides MPI communication required for both halo exchange and assembly. Our implementation of nearest-neighbor MPI communication then leverages several different routing algorithms and GPU-Direct RDMA capabilities, when available, which improves scalability of the benchmark. We demonstrate the performance of hipBone on three different clusters housed at Oak Ridge National Laboratory, namely, the Summit supercomputer and the Frontier early-access clusters, Spock and Crusher. Our tests demonstrate both portability across different clusters and very good scaling efficiency, especially on large problems.

show abstract

Section: Computational Testsmentioning

confidence: 72%

“…Gong et al (2016) demonstrated a GPU-accelerated version of NekBone using OpenACC and CUDA Fortran. This version was later improved by Karp et al (2020) using native CUDA C kernels with implementations based on the algorithms from Świrydowicz et al (2019). Porting NekBone to FPGAs was also studied by Brown (2020).…”

Section: Introductionmentioning

confidence: 99%

HipBone: A performance-portable graphics processing unit-accelerated C++ version of the NekBone benchmark

Chalmers

Mishra

McDougall

et al. 2023

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

show abstract

“…All of our solvers are designed with locality across the memory hierarchy in mind, to be able to as efficiently as possible use modern GPUs with a significant machine imbalance. Parts of this optimization process and the theoretical background is described in Karp et al (2022a).…”

Section: Numerical Solversmentioning

confidence: 99%

“…We do this by merging kernels and by utilizing shared memory and registers in compute heavy kernels as detailed in, for example, Wahib and Maruyama (2014). For modern GPUs, the spectral element method is in the memory-bound domain as discussed by Kolev et al (2021) and optimizing the code for temporal and spatial locality is our main priority when designing kernels for the GPU backend in Neko, this was recently considered in depth for the CG method used in Neko in Karp et al (2022a).…”

Section: Gpu Implementation Considerationsmentioning

confidence: 99%

Large-Scale direct numerical simulations of turbulence using GPUs and modern Fortran

Karp

Massaro

Jansson

et al. 2023

The International Journal of High Performance Computing Applica

Self Cite

View full text Add to dashboard Cite

We present our approach to making direct numerical simulations of turbulence with applications in sustainable shipping. We use modern Fortran and the spectral element method to leverage and scale on supercomputers powered by the Nvidia A100 and the recent AMD Instinct MI250X GPUs, while still providing support for user software developed in Fortran. We demonstrate the efficiency of our approach by performing the world’s first direct numerical simulation of the flow around a Flettner rotor at Re = 30,000 and its interaction with a turbulent boundary layer. We present a performance comparison between the AMD Instinct MI250X and Nvidia A100 GPUs for scalable computational fluid dynamics. Our results show that one MI250X offers performance on par with two A100 GPUs and has a similar power efficiency based on readings from on-chip energy sensors.

show abstract

“…From the computational standpoint some of the advantages of SEM are the possibility to implement it in a matrix-free fashion, avoiding the explicit construction of any operator matrix, and its weak element coupling which allows operations to be mostly performed on a local basis, reducing communication requirements. These characteristics, among others, allow the method to handle large problems and perform efficiently on large number of processing elements [21]. NEKO [19] is a portable framework that implements SEM in object oriented modern Fortran, allowing better control on memory allocation and modularity and thus providing support for multiple compute architectures.…”

Section: A Turbulence With Image Generationmentioning

confidence: 99%

Understanding the Impact of Synchronous, Asynchronous, and Hybrid In-Situ Techniques in Computational Fluid Dynamics Applications

Perez

Markidis

et al. 2022

2022 IEEE 18th International Conference on E-Science (E-Science)

View full text Add to dashboard Cite

High-Performance Computing (HPC) systems provide input/output (IO) performance growing relatively slowly compared to peak computational performance and have limited storage capacity. Computational Fluid Dynamics (CFD) applications aiming to leverage the full power of Exascale HPC systems, such as the solver Nek5000, will generate massive data for further processing. These data need to be efficiently stored via the IO subsystem. However, limited IO performance and storage capacity may result in performance, and thus scientific discovery, bottlenecks. In comparison to traditional post-processing methods, in-situ techniques can reduce or avoid writing and reading the data through the IO subsystem, promising to be a solution to these problems. In this paper, we study the performance and resource usage of three in-situ use cases: data compression, image generation, and uncertainty quantification. We furthermore analyze three approaches when these in-situ tasks and the simulation are executed synchronously, asynchronously, or in a hybrid manner. In-situ compression can be used to reduce the IO time and storage requirements while maintaining data accuracy. Furthermore, in-situ visualization and analysis can save Terabytes of data from being routed through the IO subsystem to storage. However, the overall efficiency is crucially dependent on the characteristics of both, the in-situ task and the simulation. In some cases, the overhead introduced by the in-situ tasks can be substantial. Therefore, it is essential to choose the proper in-situ approach, synchronous, asynchronous, or hybrid, to minimize overhead and maximize the benefits of concurrent execution.

show abstract

Reducing communication in the conjugate gradient method

Cited by 6 publications

References 27 publications

HipBone: A performance-portable graphics processing unit-accelerated C++ version of the NekBone benchmark

HipBone: A performance-portable graphics processing unit-accelerated C++ version of the NekBone benchmark

Large-Scale direct numerical simulations of turbulence using GPUs and modern Fortran

Understanding the Impact of Synchronous, Asynchronous, and Hybrid In-Situ Techniques in Computational Fluid Dynamics Applications

Contact Info

Product

Resources

About