Esoteric Pull and Esoteric Push: Two Simple In-Place Streaming Schemes for the Lattice Boltzmann Method on GPUs

Lehmann, Moritz F.

doi:10.3390/computation10060092

Cited by 15 publications

(12 citation statements)

References 54 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This offers a large benefit, most prominently on FP16 accuracy, by substantially reducing numerical loss of significance at no additional computational cost. Since it is also beneficial for regular FP32 accuracy, it is already widely used in LBM codes such as our FluidX3D [6][7][8][9][10][11][12], OpenLB [68][69][70][71], ESPResSo [24][25][26], Palabos [72][73][74][75][76], and some versions of waLBerla [53]. In Appendix A 2, we provide the entire algorithm without and with DDF-shifting for comparison and in Appendix A 3 we clarify our notation.…”

Section: B Ddf-shiftingmentioning

confidence: 99%

“…For D3Q19, going from FP32/FP32 to FP32-16x reduces the memory footprint by ≈45%, to 93 bytes per node. If 16-bit compression was combined with in-place streaming schemes like AA-Pattern [34], Esoteric-Twist [62], Shift-and-Swap-Streaming [59], or the simple Esoteric-Pull [9], the memory footprint can even be reduced by ≈67%, to only 55 bytes per node.…”

Section: Memory and Performance Comparisonmentioning

confidence: 99%

“…While most LBM implementations are limited to one particular hardware platform-either CPUs [67][68][69][70][71][72][73][74][75][76][77][78][79], Nvidia FP64/FP64 q 16 q 17 q FP64/FP32 q 8 q 9 q FP32/FP32 q 8 q 9 q FP32/16-bit q 4 q 5 q GPUs , CPUs and Nvidia GPUs [18][19][20][21][22][23][24][25][26][27][28][29] or mobile SoCs [122,123]-only few use OpenCL [5][6][7][8][9][10][11][12][13][14][15][16][17]. With FluidX3D also being implemented in OpenCL, we are able to benchmark our code across a large variety of hardware, from the world's fastest data-center GPUs over gaming GPUs and CPUs to even the GPUs of mobile phone ARM SoCs.…”

Section: Memory and Performance Comparisonmentioning

confidence: 99%

“…This approach requires minimal code interventions and can be easily combined with any velocity set, collision operator, swap algorithm, or LBM extension. In addition to using custombuilt floating-point formats, we show that shifting the DDFs by subtracting the lattice weights and computing the equilibrium DDFs in a specific order of operations as originally proposed by Skordos [66] and further investigated by He and Luo [94] and Gray and Boek [60]-an optimization beneficial across all floating-point formats and already widely used [6][7][8][9][10][11][12][24][25][26]31,35,53,60,66,[68][69][70][71][72][73][74][75][76]88,94]-turns out absolutely crucial for the 16-bit compression.…”

Section: Introductionmentioning

confidence: 99%

“…Regarding LBM accuracy, we study Poiseuille flow through a cylinder [95], Taylor-Green vortex energy dissipation [66,96], Karman vortices [97] from flow around a cylinder, lid-driven cavity [30,33,37,39,49,52,[98][99][100][101][102][103], deformation of a microcapsule in shear flow [104][105][106] with the immersed-boundary method (IBM) extension, and microplastic particle transport during a raindrop impact [10] with the volume-of-fluid and IBM extensions. Regarding performance, we exploit the capability of our FluidX3D LBM implementation written in OpenCL [6][7][8][9][10][11][12] to provide benchmarks for all floating-point variants on a large variety of hardware, from the world's fastest datacenter GPU over various consumer GPUs and CPUs from different vendors to even a mobile phone ARM system-on-a-chip (SoC), and show roofline analysis [64,87,107] for one hardware example.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit, and customized 16-bit number formats

et al. 2022

Self Cite

View full text Add to dashboard Cite

Fluid dynamics simulations with the lattice Boltzmann method (LBM) are very memory intensive. Alongside reduction in memory footprint, significant performance benefits can be achieved by using FP32 (single) precision compared to FP64 (double) precision, especially on GPUs. Here we evaluate the possibility to use even FP16 and posit16 (half) precision for storing fluid populations, while still carrying arithmetic operations in FP32. For this, we first show that the commonly occurring number range in the LBM is a lot smaller than the FP16 number range. Based on this observation, we develop customized 16-bit formats-based on a modified IEEE-754 and on a modified posit standard-that are specifically tailored to the needs of the LBM. We then carry out an in-depth characterization of LBM accuracy for six different test systems with increasing complexity: Poiseuille flow, Taylor-Green vortices, Karman vortex streets, lid-driven cavity, a microcapsule in shear flow (utilizing the immersed-boundary method), and, finally, the impact of a raindrop (based on a volume-of-fluid approach). We find that the difference in accuracy between FP64 and FP32 is negligible in almost all cases, and that for a large number of cases even 16-bit is sufficient. Finally, we provide a detailed performance analysis of all precision levels on a large number of hardware microarchitectures and show that significant speedup is achieved with mixed FP32/16-bit.

show abstract

Section: B Ddf-shiftingmentioning

confidence: 99%

Section: Memory and Performance Comparisonmentioning

confidence: 99%

Section: Memory and Performance Comparisonmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit, and customized 16-bit number formats

et al. 2022

Self Cite

View full text Add to dashboard Cite

show abstract

A graphic processing unit implementation for the moment representation of the lattice Boltzmann method

Ferrari

Oliveira

Lugarini

et al. 2023

Numerical Methods in Fluids

View full text Add to dashboard Cite

We implement the moment representation of the lattice Boltzmann method in a graphic processing unit (GPU) environment to study the computational performance of three-dimensional fluid flows. The moment representation of the lattice Boltzmann method (MRLBM) uses up to second-order moments of the particle distribution function to regularize and reconstruct it. By adopting this method, there is a reduction in memory usage and global memory transfer and, therefore, increased performance. The results show an increase of up to 40% in speed compared to population schemes. The validation results show there was no loss of accuracy when using single precision for turbulent flows. The adoption of single-precision also enables higher processing speed for GPUs with low double-precision capability and reduces the memory footprint, which can further increase performance. The implementation here presented provides a fast and straightforward environment for fluid simulation by combining single-precision and the MRLBM.

show abstract

A robust and efficient solver based on kinetic schemes for Magnetohydrodynamics (MHD) equations

Baty

Drui

Franck

et al. 2023

Applied Mathematics and Computation

View full text Add to dashboard Cite

Esoteric Pull and Esoteric Push: Two Simple In-Place Streaming Schemes for the Lattice Boltzmann Method on GPUs

Cited by 15 publications

References 54 publications

Accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit, and customized 16-bit number formats

Accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit, and customized 16-bit number formats

A graphic processing unit implementation for the moment representation of the lattice Boltzmann method

A robust and efficient solver based on kinetic schemes for Magnetohydrodynamics (MHD) equations

Contact Info

Product

Resources

About