A flexible Patch-based lattice Boltzmann parallelization approach for heterogeneous GPU–CPU clusters

Feichtinger, Christian; Habich, Johannes; Köstler, Harald; Hager, Georg; Rüde, Ulrich; Wellein, Gerhard

doi:10.1016/j.parco.2011.03.005

Cited by 49 publications

(28 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Simple in its design, the scenario is yet representative for a large class of problems regarding performance considerations for LBM simulations on regular Cartesian grids. It is further used as a benchmark in a variety of LBM performance evaluations and comparisons [7,9,10,14]. A relaxation time τ = 0.6152 ∈ (0.5, 2) is applied and all test runs take 1024 timesteps, thus, 512 α-and β-steps are executed.…”

Section: Resultsmentioning

confidence: 99%

A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters

et al. 2017

View full text Add to dashboard Cite

Heterogeneous clusters are a widely utilized class of supercomputers assembled from different types of computing devices, for instance CPUs and GPUs, providing a huge computational potential. Programming them in a scalable way exploiting the maximal performance introduces numerous challenges such as optimizations for different computing devices, dealing with multiple levels of parallelism, the application of different programming models, work distribution, and hiding of communication with computation. We utilize the lattice Boltzmann method for fluid flow as a representative of a scientific computing application and develop a holistic implementation for large-scale CPU/GPU heterogeneous clusters. We review and combine a set of best practices and techniques ranging from optimizations for the particular computing devices to the orchestration of tens of thousands of CPU cores and thousands of GPUs. Eventually, we come up with an implementation using all the available computational resources for the lattice Boltzmann method operators. Our approach shows excellent scalability behavior making it future-proof for heterogeneous clusters of the upcoming architectures on the exaFLOPS scale. Parallel efficiencies of more than 90% are achieved leading to 2604.72 GLUPS utilizing 24,576 CPU cores and 2048 GPUs of the CPU/GPU heterogeneous cluster Piz Daint and computing more than 6.8 × 10 9 lattice cells.

show abstract

Section: Resultsmentioning

confidence: 99%

A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters

et al. 2017

View full text Add to dashboard Cite

show abstract

“…We are not the first group setting up such a solver. However, there are a few major differences to other approaches: [55] and [56] both implement the lattice-Boltzmann method on top of a highly optimised octree grid implementation. I.e., the solver itself had to be developed more or less from scratch in a way that is adapted to the grid structure and its specific memory-optimal traversal in the Peano framework [57] or waLBerla [58].…”

Section: Discussionmentioning

confidence: 99%

“…I.e., the solver itself had to be developed more or less from scratch in a way that is adapted to the grid structure and its specific memory-optimal traversal in the Peano framework [57] or waLBerla [58]. Also, [56] provides a solver on statically adaptive meshes, which allows for much more sophisticated solutions in terms of run-time optimisation per core and on massively parallel hardware. In contrast, we focus on full dynamical adaptivity as required by the physical systems simulated with ESPResSo, where regions with high refinement requirements move with particles or molecules immersed in the flow.…”

Section: Discussionmentioning

confidence: 99%

Towards Lattice-Boltzmann on Dynamically Adaptive Grids — Minimally-Invasive Grid Exchange in Espresso

Lahnert¹,

Burstedde²,

Holm³

et al. 2016

Proceedings of the VII European Congress on Computational Methods in Applied Sciences and Engineering (ECCOMAS Congress 2016)

View full text Add to dashboard Cite

Abstract. We present the minimally-invasive exchange of the regular Cartesian grid in the lattice-Boltzmann solver of ESPResSo by a dynamically-adaptive octree grid. Octree grids are favoured by computer scientists over other grid types as they are very memory-efficient. In addition, they represent a natural generalisation of regular Cartesian grids, such that most discretisation details of a regular grid solver can be maintained. Optimised codes, however, require a special tree-oriented grid traversal, which typically conflicts with existing simulation codes using various iterators, some for only parts of the grid, e.g., boundaries. ESPResSo is a large software package developed for soft-matter simulations involving fluid flow, electrostatic, and electrokinetic effects, and molecular dynamics. The currently used regular Cartesian grid hinders the simulation of realistic domain sizes and significant time periods, a problem that can be solved using grid adaptivity. In a first step, we focus on the lattice-Boltzmann flow solver in ESPResSo.p4est is a grid framework, that already provides dynamically adaptive quadtree and octree grids together with high-level interfaces for flexible grid traversals with direct neighbour access in all grid components. In this paper, we first describe extensions of p4est that were necessary to fulfill certain application requirements. The second part of our work consists of the minimally-invasive changes in ESPResSo preserving the expertise accumulated in the software's implementation over years. Our numerical results demonstrate physical correctness of the implementation, good parallel scalability and low overhead of the dynamical grid adaptivity. These are prerequisites to actually profit from grid adaptivity in terms of being able to simulate larger domains over longer time periods with limited computational resources. Thus, the current status forms a solid basis for further steps such as the development of refinement criteria, the setup of more realistic application scenarios, and a GPU implementation.

show abstract

“…One of them is data structure design which can affect the speed of memory access. Array of Structure(AoS) [19] is chosen on multi-core CPU, which means that the particle distribution functions (PDFs) of each cell are stored adjacent in memory. Whereas, for the Structure-ofArrays(SoA) layout used on GPU, the PDFs pointing in the same lattice direction are adjacent in global memory.…”

Section: Data Structure Design and Memory Access Modelmentioning

confidence: 99%

“…In the case of LBM implementation for multi-GPU and heterogeneous CPU-GPU clusters, Wang et al [18] performed simulations of lid-driven cavity flow on a HPC system named Tsubame comprising 170 NVIDIA Tesla S1070 boxes. Christian et al [19] developed an approach for heterogeneous simulations on clusters equipped with varying node configurations, using WaLBerla framework. In terms of ELBM parallelization and optimization strategies, there are a few research works [20,21] in order to reduce the computational overhead, most of them are improved from the numerical calculation aspect.…”

Section: Introductionmentioning

confidence: 99%

Parallel computation of Entropic Lattice Boltzmann method on hybrid CPU–GPU accelerated system

Wang

et al. 2015

Computers & Fluids

View full text Add to dashboard Cite

A flexible Patch-based lattice Boltzmann parallelization approach for heterogeneous GPU–CPU clusters

Cited by 49 publications

References 18 publications

A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters

A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters

Towards Lattice-Boltzmann on Dynamically Adaptive Grids — Minimally-Invasive Grid Exchange in Espresso

Parallel computation of Entropic Lattice Boltzmann method on hybrid CPU–GPU accelerated system

Contact Info

Product

Resources

About