Improving 3D lattice boltzmann method stencil with asynchronous transfers on many-core processors

Ho, Minh Quan; Obrecht, Christian; Tourancheau, Bernard; Dinechin, Benoît Dupont de; Hascoët, Julien

doi:10.1109/pccc.2017.8280472

Cited by 9 publications

(11 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…However, most GPUs have only poor FP64 (double precision) arithmetic capabilities 1 and thus the vast majority of GPU codes have been implemented in FP32 (single precision), while most CPU codes are written in FP64. This difference, and, in particular, whether FP32 is sufficient for LBM simulations compared to FP64, has been a point of persistent discussion within the LBM community [15][16][17][18][19][20][31][32][33][34][35][36][52][53][54][55][56][57][58]60,[63][64][65]. Nevertheless, only a few papers [19,35,36,52,60,66] provide some comparison on how floating-point formats affect the accuracy of the LBM and mostly find only insignificant differences between FP64 and FP32 except at very low velocity and where floating-point round-off leads to spontaneous symmetry breaking.…”

Section: Introductionmentioning

confidence: 99%

“…This is equivalent to decoupling arithmetic precision and memory precision [84,85]. As a desirable side effect, since the limiting factor regarding compute time is memory bandwidth [12][13][14][15][16][17][18][19][20][21][30][31][32][33][34][35][36][37][38][39][40][41][42][43][44][45][52][53][54][55]59,60,63,64,67,[86][87][88], lower precision DDFs also vastly increase performance. Such a mixed precision variant, where arithmetic is done in FP64 and DDF storage in FP32, has already been used by Refs.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit, and customized 16-bit number formats

et al. 2022

View full text Add to dashboard Cite

Fluid dynamics simulations with the lattice Boltzmann method (LBM) are very memory intensive. Alongside reduction in memory footprint, significant performance benefits can be achieved by using FP32 (single) precision compared to FP64 (double) precision, especially on GPUs. Here we evaluate the possibility to use even FP16 and posit16 (half) precision for storing fluid populations, while still carrying arithmetic operations in FP32. For this, we first show that the commonly occurring number range in the LBM is a lot smaller than the FP16 number range. Based on this observation, we develop customized 16-bit formats-based on a modified IEEE-754 and on a modified posit standard-that are specifically tailored to the needs of the LBM. We then carry out an in-depth characterization of LBM accuracy for six different test systems with increasing complexity: Poiseuille flow, Taylor-Green vortices, Karman vortex streets, lid-driven cavity, a microcapsule in shear flow (utilizing the immersed-boundary method), and, finally, the impact of a raindrop (based on a volume-of-fluid approach). We find that the difference in accuracy between FP64 and FP32 is negligible in almost all cases, and that for a large number of cases even 16-bit is sufficient. Finally, we provide a detailed performance analysis of all precision levels on a large number of hardware microarchitectures and show that significant speedup is achieved with mixed FP32/16-bit.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit, and customized 16-bit number formats

et al. 2022

View full text Add to dashboard Cite

show abstract

“…Thanks to the simple nature of individual stencil code and ease of parallelization, the performance of stencil code accelerators are typically not bound by their computational capacity, but by the speed in which grid data can be accessed [14,27,43]. As a result, a large amount of work has focused on memory access and re-use methodologies, aiming to improve the ratio between the amount of memory access and computation.…”

Section: Stencil Computing and Its Accelerationmentioning

confidence: 99%

BurstZ+: Eliminating The Communication Bottleneck of Scientific Computing Accelerators via Accelerated Compression

Sun

Kang

Jun

2022

ACM Trans. Reconfigurable Technol. Syst.

View full text Add to dashboard Cite

We present BurstZ+, an accelerator platform that eliminates the communication bottleneck between PCIe-attached scientific computing accelerators and their host servers, via hardware-optimized compression. While accelerators such as GPUs and FPGAs provide enormous computing capabilities, their effectiveness quickly deteriorates once data is larger than its on-board memory capacity, and performance becomes limited by the communication bandwidth of moving data between the host memory and accelerator. Compression has not been very useful in solving this issue due to performance and efficiency issues of compressing floating point numbers, which scientific data often consists of. BurstZ+ is an FPGA-based prototype accelerator platform which addresses the bandwidth issue via a class of novel hardware-optimized floating point compression algorithm called ZFP-V. We demonstrate that BurstZ+ can completely remove the host-side communication bottleneck for accelerators, using multiple stencil kernels with a wide range of operational intensities. Evaluated against hand-optimized implementations of kernel accelerators of the same architecture, our single-pipeline BurstZ+ prototype outperforms an accelerator without compression by almost 4×, and even an accelerator with enough memory for the entire dataset by over 2×. Furthermore, the projected performance of BurstZ+ on a future, faster FPGA scales to almost 7× that of the same accelerator without compression, whose performance is still limited by the PCIe bandwidth.

show abstract

“…Nvidia GPUs , CPUs and Nvidia GPUs [16][17][18][19][20][21][22][23][24][25][26] or mobile SoCs [116,117] -only few use OpenCL [5][6][7][8][9][10][11][12][13][14][15]. With FluidX3D also being implemented in OpenCL, we are able to benchmark our code across a large variety of hardware, from the worlds fastest data-center GPUs over gaming GPUs and CPUs to even the GPUs of mobile phone ARM SoCs.…”

Section: Memory and Performance Comparisonmentioning

confidence: 99%

“…8.9 LBM core of the FluidX3D OpenCL C implementation (D3Q19 SRT FP32/xx) feq [ 1] = fma ( rhos , fma (0.5 f , fma ( ux , ux , c3 ) , ux ) , rhom1s ) ; feq [ 2] = fma ( rhos , fma (0.5 f , fma ( ux , ux , c3 ) , -ux ) , rhom1s ) ; // +00 -00 20 feq [ 3] = fma ( rhos , fma (0.5 f , fma ( uy , uy , c3 ) , uy ) , rhom1s ) ; feq [ 4] = fma ( rhos , fma (0.5 f , fma ( uy , uy , c3 ) , -uy ) , rhom1s ) ; // 0+0 0 -0 21 feq [ 5] = fma ( rhos , fma (0.5 f , fma ( uz , uz , c3 ) , uz ) , rhom1s ) ; feq [ 6] = fma ( rhos , fma (0.5 f , fma ( uz , uz , c3 ) , -uz ) , rhom1s ) ; // 00+ 00 -22 feq [ 7] = fma ( rhoe , fma (0.5 f , fma ( u0 , u0 , c3 ) , u0 ) , rhom1e ) ; feq [ 8] = fma ( rhoe , fma (0.5 f , fma ( u0 , u0 , c3 ) , -u0 ) , rhom1e ) ; // ++0 --0 23 feq [ 9] = fma ( rhoe , fma (0.5 f , fma ( u1 , u1 , c3 ) , u1 ) , rhom1e ) ; feq [10] = fma ( rhoe , fma (0.5 f , fma ( u1 , u1 , c3 ) , -u1 ) , rhom1e ) ; // +0+ -0 -24 feq [11] = fma ( rhoe , fma (0.5 f , fma ( u2 , u2 , c3 ) , u2 ) , rhom1e ) ; feq [12] = fma ( rhoe , fma (0.5 f , fma ( u2 , u2 , c3 ) , -u2 ) , rhom1e ) ; // 0++ 0 --25 feq [13] = fma ( rhoe , fma (0.5 f , fma ( u3 , u3 , c3 ) , u3 ) , rhom1e ) ; feq [14] = fma ( rhoe , fma (0.5 f , fma ( u3 , u3 , c3 ) , -u3 ) , rhom1e ) ; // + -0 -+0 26 feq [15] = fma ( rhoe , fma (0.5 f , fma ( u4 , u4 , c3 ) , u4 ) , rhom1e ) ; feq [16] = fma ( rhoe , fma (0.5 f , fma ( u4 , u4 , c3 ) , -u4 ) , rhom1e ) ; // +0 --0+ 27 feq [17] = fma ( rhoe , fma (0.5 f , fma ( u5 , u5 , c3 ) , u5 ) , rhom1e ) ; feq [18] = fma ( rhoe , fma (0.5 f , fma ( u5 , u5 , c3 ) , -u5 ) , rhom1e ) ; // 0+ -0 -+…”

Section: Lbm Equations In a Nutshellunclassified

On the accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit and novel 16-bit number formats

Lehmann¹,

Krause²,

Amati³

et al. 2021

Preprint

View full text Add to dashboard Cite

Fluid dynamics simulations with the lattice Boltzmann method (LBM) are very memory-intensive. Alongside reduction in memory footprint, significant performance benefits can be achieved by using FP32 (single) precision compared to FP64 (double) precision, especially on GPUs. Here, we evaluate the possibility to use even FP16 and Posit16 (half) precision for storing fluid populations, while still carrying arithmetic operations in FP32. For this, we first show that the commonly occurring number range in the LBM is a lot smaller than the FP16 number range. Based on this observation, we develop novel 16-bit formats -based on a modified IEEE-754 and on a modified Posit standard -that are specifically tailored to the needs of the LBM. We then carry out an in-depth characterization of LBM accuracy for six different test systems with increasing complexity: Poiseuille flow, Taylor-Green vortices, Karman vortex streets, liddriven cavity, a microcapsule in shear flow (utilizing the immersed-boundary method) and finally the impact of a raindrop (based on a Volume-of-Fluid approach). We find that the difference in accuracy between FP64 and FP32 is negligible in almost all cases, and that for a large number of cases even 16-bit is sufficient. Finally, we provide a detailed performance analysis of all preci-sion levels on a large number of hardware microarchitectures and show that significant speedup is achieved with mixed FP32/16-bit.

show abstract

Improving 3D lattice boltzmann method stencil with asynchronous transfers on many-core processors

Cited by 9 publications

References 13 publications

Accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit, and customized 16-bit number formats

Accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit, and customized 16-bit number formats

BurstZ+: Eliminating The Communication Bottleneck of Scientific Computing Accelerators via Accelerated Compression

On the accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit and novel 16-bit number formats

Contact Info

Product

Resources

About