A Study on Parallel Performance of the EULAG F90/95 Code

2016

J Supercomput

The goal of this study is to parallelize the multidimensional positive definite advection transport algorithm (MPDATA) across a computational cluster equipped with GPUs. Our approach permits us to provide an extensive overlapping GPU computations and data transfers, both between computational nodes, as well as between the GPU accelerator and CPU host within a node. For this aim, we decompose a computational domain into two unequal parts which correspond to either data dependent or data independent parts. Then, data transfers can be performed simultaneously with computations corresponding to the second part. Our approach allows for achieving 16.372 Tflop/s using 136 GPUs. To estimate the scalability of the proposed approach, a performance model dedicated to MPDATA simulations is developed. We focus on the analysis of computation and communication execution times, as well as the influence of overlapping data transfers and GPU computations, with regard to the number of nodes.

Section: Related Workmentioning

confidence: 99%

“…The EULAG model has the proven record of successful applications, and excellent efficiency and scalability on the conventional supercomputer architectures [18]. Currently, the model is being implemented as the new dynamical core of the COSMO weather prediction framework.…”

Section: Introductionmentioning

confidence: 99%

Performance modeling of 3D MPDATA simulations on GPU cluster

2016

J Supercomput

“…In these computations, each point in a 3D data grid is updated based on its neighbors [18] according to a fixed rule. In these computations, each point in a 3D data grid is updated based on its neighbors [18] according to a fixed rule.…”

Section: Multidimensional Positive Definite Advection Transport Algormentioning

confidence: 99%

Systematic adaptation of stencil‐based 3D MPDATA to GPU architectures

Concurrency and Computation

Kuczynski

2016

In this work, we focus on a systematic adaptation of the stencil-based multidimensional positive definite advection transport algorithm (MPDATA) to different graphics processing unit (GPU)-based computing platforms. Another objective of this work is to compare the performance of MPDATA on several platforms, including a multi-GPU system with two NVIDIA Tesla K80 cards, and single-card platforms with Tesla K20X, GeForce GTX TITAN, and GeForce GTX 980. The usage of the following optimization methods is proposed to improve the overall performance: (i) reducing the number of operations by the subexpression elimination when implementing 2.5D blocking; (ii) reorganization of boundary conditions for reducing branch instructions; (iii) advanced memory management to increase the coalesced memory access; and (iv) warps rearrangement for optimizing the data access to GPU global memory. The presented methods of the MPDATA adaptation to GPU architectures allow us to efficiently use many graphics processors within a single node by applying peer-to-peer data transfers between GPU global memories. We propose an auto-tuning procedure to compensate architectural differences between the considered platforms. This procedure takes into account algorithm/GPU-specific parameters. The proposed approach to adaptation of MPDATA to GPU architectures allows us to achieve up to 482.5 Gflop/s for the platform equipped with two NVIDIA K80 GPUs. for simulating thermo-fluid flows across a wide range of scales and physical scenarios, such as numerical weather and climate prediction, simulation of urban flows, areas of turbulence, ocean currents, and others. Recently, the dynamical core of EULAG has been implemented into consortium for small-scale modeling weather prediction framework and is expected to be in operational use [5]. The dynamical core of EULAG is based on the non-hydrostatic Euler equations, either fully compressible or anelastic. The model employs the generalized curvilinear coordinate description, finite-volume non-oscillatory transport MPDATA, and advanced elliptic solver generalized conjugate residual (GCR) [6].To be able to run the existing codes efficiently on new hybrid platforms with accelerators, it is necessary to redesign structures of these codes [7]. In our previous work [8], we proposed two decompositions of 2D MPDATA computations, which provide adaptation to CPU and GPU architectures. We developed a hybrid CPU-GPU version of 2D MPDATA in order to fully utilize all the available computing resources. The next step in our research was to parallelize the 3D version of MPDATA. It required to develop a different approach than for the 2D version. In papers [7,9], we presented an analysis of resources usage in GPU, and its influence on the resulting performance. We detected the bottlenecks and developed a method for the efficient distribution of computation across GPU kernels.Following our previous papers, in this work, we propose a set of methods for adaptating the 3D MPDATA to different GPU accelerators. We investigate differ...

“…In this work, we focus on energy optimization for a predefined execution time, considering the multidimensional positive definite advection transport algorithm (MPDATA) . Multidimensional positive definite advection transport algorithm is the main part of the dynamic core of the Eulerian/semi‐Lagrangian (EULAG) fluid solver model, which is an established computational model developed for simulating thermo‐fluid flows across a wide range of scales and physical scenarios . Currently, the model is being implemented as the new dynamic core of the COSMO weather prediction framework .…”

Section: Introductionmentioning

confidence: 99%

Energy‐aware mechanism for stencil‐based MPDATA algorithm with constraints

Ilić

Concurrency and Computation

et al. 2016

Summary In this paper, we propose an energy‐aware task management mechanism designed for the forward‐in‐time algorithms running on multicore central processing units (CPUs), where the multidimensional positive definite advection transport algorithm stencil‐based algorithm is one of the representative examples. This mechanism is based on the dynamic voltage and frequency scaling technique and allows the reduction of energy consumption for an existing algorithm (or application) such that the predefined execution time is respected, without requiring any modifications in the algorithm itself. This paper also provides the formulation of a method for minimizing the energy consumption with time constraints, which is based on the adaptive scheduling with online modeling. Finally, using the autotuning technique, we provide the automation of the process for creation and determination of the best energy profile at runtime, even in the presence of additional CPU workloads. The experimental results on a 6‐core computing platform show that the proposed mechanism provides the energy savings of up to 1.43x when compared to the default Linux scaling governor. Also, we confirm the effectiveness of the self‐adaptive feature of the proposed mechanism, by showing its ability to maintain the requested execution time in spite of additional CPU workloads imposed by other applications.