2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC) 2017
DOI: 10.1109/pccc.2017.8280472
|View full text |Cite
|
Sign up to set email alerts
|

Improving 3D lattice boltzmann method stencil with asynchronous transfers on many-core processors

Abstract: CPU-based many-core processors present an alternative to multicore CPU and GPU processors. In particular, the 93-Petaflops Sunway supercomputer, built from clustered many-core processors, has opened a new era for high performance computing that does not rely on GPU acceleration. However, memory bandwidth remains the main challenge for these architectures. This motivates our endeavor for optimizing one of the most data-intensive kind of stencil computations, namely the three-dimensional applications of the latt… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0
1

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(11 citation statements)
references
References 13 publications
0
10
0
1
Order By: Relevance
“…However, most GPUs have only poor FP64 (double precision) arithmetic capabilities 1 and thus the vast majority of GPU codes have been implemented in FP32 (single precision), while most CPU codes are written in FP64. This difference, and, in particular, whether FP32 is sufficient for LBM simulations compared to FP64, has been a point of persistent discussion within the LBM community [15][16][17][18][19][20][31][32][33][34][35][36][52][53][54][55][56][57][58]60,[63][64][65]. Nevertheless, only a few papers [19,35,36,52,60,66] provide some comparison on how floating-point formats affect the accuracy of the LBM and mostly find only insignificant differences between FP64 and FP32 except at very low velocity and where floating-point round-off leads to spontaneous symmetry breaking.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…However, most GPUs have only poor FP64 (double precision) arithmetic capabilities 1 and thus the vast majority of GPU codes have been implemented in FP32 (single precision), while most CPU codes are written in FP64. This difference, and, in particular, whether FP32 is sufficient for LBM simulations compared to FP64, has been a point of persistent discussion within the LBM community [15][16][17][18][19][20][31][32][33][34][35][36][52][53][54][55][56][57][58]60,[63][64][65]. Nevertheless, only a few papers [19,35,36,52,60,66] provide some comparison on how floating-point formats affect the accuracy of the LBM and mostly find only insignificant differences between FP64 and FP32 except at very low velocity and where floating-point round-off leads to spontaneous symmetry breaking.…”
Section: Introductionmentioning
confidence: 99%
“…This is equivalent to decoupling arithmetic precision and memory precision [84,85]. As a desirable side effect, since the limiting factor regarding compute time is memory bandwidth [12][13][14][15][16][17][18][19][20][21][30][31][32][33][34][35][36][37][38][39][40][41][42][43][44][45][52][53][54][55]59,60,63,64,67,[86][87][88], lower precision DDFs also vastly increase performance. Such a mixed precision variant, where arithmetic is done in FP64 and DDF storage in FP32, has already been used by Refs.…”
Section: Introductionmentioning
confidence: 99%
“…Thanks to the simple nature of individual stencil code and ease of parallelization, the performance of stencil code accelerators are typically not bound by their computational capacity, but by the speed in which grid data can be accessed [14,27,43]. As a result, a large amount of work has focused on memory access and re-use methodologies, aiming to improve the ratio between the amount of memory access and computation.…”
Section: Stencil Computing and Its Accelerationmentioning
confidence: 99%
“…Nvidia GPUs , CPUs and Nvidia GPUs [16][17][18][19][20][21][22][23][24][25][26] or mobile SoCs [116,117] -only few use OpenCL [5][6][7][8][9][10][11][12][13][14][15]. With FluidX3D also being implemented in OpenCL, we are able to benchmark our code across a large variety of hardware, from the worlds fastest data-center GPUs over gaming GPUs and CPUs to even the GPUs of mobile phone ARM SoCs.…”
Section: Memory and Performance Comparisonmentioning
confidence: 99%
“…8.9 LBM core of the FluidX3D OpenCL C implementation (D3Q19 SRT FP32/xx) feq [ 1] = fma ( rhos , fma (0.5 f , fma ( ux , ux , c3 ) , ux ) , rhom1s ) ; feq [ 2] = fma ( rhos , fma (0.5 f , fma ( ux , ux , c3 ) , -ux ) , rhom1s ) ; // +00 -00 20 feq [ 3] = fma ( rhos , fma (0.5 f , fma ( uy , uy , c3 ) , uy ) , rhom1s ) ; feq [ 4] = fma ( rhos , fma (0.5 f , fma ( uy , uy , c3 ) , -uy ) , rhom1s ) ; // 0+0 0 -0 21 feq [ 5] = fma ( rhos , fma (0.5 f , fma ( uz , uz , c3 ) , uz ) , rhom1s ) ; feq [ 6] = fma ( rhos , fma (0.5 f , fma ( uz , uz , c3 ) , -uz ) , rhom1s ) ; // 00+ 00 -22 feq [ 7] = fma ( rhoe , fma (0.5 f , fma ( u0 , u0 , c3 ) , u0 ) , rhom1e ) ; feq [ 8] = fma ( rhoe , fma (0.5 f , fma ( u0 , u0 , c3 ) , -u0 ) , rhom1e ) ; // ++0 --0 23 feq [ 9] = fma ( rhoe , fma (0.5 f , fma ( u1 , u1 , c3 ) , u1 ) , rhom1e ) ; feq [10] = fma ( rhoe , fma (0.5 f , fma ( u1 , u1 , c3 ) , -u1 ) , rhom1e ) ; // +0+ -0 -24 feq [11] = fma ( rhoe , fma (0.5 f , fma ( u2 , u2 , c3 ) , u2 ) , rhom1e ) ; feq [12] = fma ( rhoe , fma (0.5 f , fma ( u2 , u2 , c3 ) , -u2 ) , rhom1e ) ; // 0++ 0 --25 feq [13] = fma ( rhoe , fma (0.5 f , fma ( u3 , u3 , c3 ) , u3 ) , rhom1e ) ; feq [14] = fma ( rhoe , fma (0.5 f , fma ( u3 , u3 , c3 ) , -u3 ) , rhom1e ) ; // + -0 -+0 26 feq [15] = fma ( rhoe , fma (0.5 f , fma ( u4 , u4 , c3 ) , u4 ) , rhom1e ) ; feq [16] = fma ( rhoe , fma (0.5 f , fma ( u4 , u4 , c3 ) , -u4 ) , rhom1e ) ; // +0 --0+ 27 feq [17] = fma ( rhoe , fma (0.5 f , fma ( u5 , u5 , c3 ) , u5 ) , rhom1e ) ; feq [18] = fma ( rhoe , fma (0.5 f , fma ( u5 , u5 , c3 ) , -u5 ) , rhom1e ) ; // 0+ -0 -+…”
Section: Lbm Equations In a Nutshellunclassified