Locality optimized unstructured mesh algorithms on GPUs

Sulyok, Andras Attila; Balogh, Gábor Dániel; Reguly, István Z.; Mudalige, Gihan R.

doi:10.1016/j.jpdc.2019.07.011

Cited by 12 publications

(10 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Applying this strategy to CUDA, means that thread blocks need to be coloured and no two thread blocks will write to the same data. However for CUDA, previous work [43] has shown that a further level of colouring gives better performance. In this case the threads within a thread block is also coloured to avoid data races.…”

Section: A Cudamentioning

confidence: 98%

OP2-Clang: A Source-to-Source Translator Using Clang/LLVM LibTooling

Balogh

Mudalige

Reguly

et al. 2018

2018 IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC)

View full text Add to dashboard Cite

Section: A Cudamentioning

confidence: 98%

OP2-Clang: A Source-to-Source Translator Using Clang/LLVM LibTooling

Balogh

Mudalige

Reguly

et al. 2018

2018 IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC)

View full text Add to dashboard Cite

“…Coloring can be similarly used for parallelizing on GPUs. Given the larger number of threads executable on GPUs, and the availability of GPU shared memory, different variations of coloring can be used [21]. For distributed memory parallelizations, such as using MPI, explicitly partitioning the mesh and assigning them to different processors leads to a decomposition of work that only have the potential to overlap at the boundaries of the partitions.…”

Section: Parallelizing Unstructured-mesh Applicationsmentioning

confidence: 99%

“…An owner compute model with redundant computation can be used in this case to handle data races [14]. Other strategies published for parallelizing unstructured-mesh applications have included the use of a large temporary array [1] and atomics [21]. Using a large temporary array entails storing the indirect increments for the nodes in a staging array, during the edge loop, for example, and then a separate iteration over the nodes to apply the increments from the temporary array on to the nodal data.…”

Section: Parallelizing Unstructured-mesh Applicationsmentioning

confidence: 99%

“…This parallelization strategy can be generally implemented on any shared memory multi-threaded system, including CPUs and GPUs without any restrictions due to hardware capabilities. Different variations of coloring have been implemented within OP2 as detailed in previous works [21]. Figure 4 details an excerpt of the SYCL code generated by OP2 for the most time-consuming parallel loop compute_flux_edge _kernel in MG-CFD.…”

Section: Coloringmentioning

confidence: 99%

See 1 more Smart Citation

Under the Hood of SYCL – An Initial Performance Analysis with An Unstructured-Mesh CFD Application

Reguly

Owenson

Powell

et al. 2021

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…However, for complex geometry, the effects of reordering are weak [23]. Some researchers studied the influence of SOA and AOS data layout on data locality [24]. However, indirect memory access still exists in different data layouts.…”

Section: Introductionmentioning

confidence: 99%

Effects of mesh loop modes on performance of unstructured finite volume GPU simulations

et al. 2021

View full text Add to dashboard Cite

In unstructured finite volume method, loop on different mesh components such as cells, faces, nodes, etc is used widely for the traversal of data. Mesh loop results in direct or indirect data access that affects data locality significantly. By loop on mesh, many threads accessing the same data lead to data dependence. Both data locality and data dependence play an important part in the performance of GPU simulations. For optimizing a GPU-accelerated unstructured finite volume Computational Fluid Dynamics (CFD) program, the performance of hot spots under different loops on cells, faces, and nodes is evaluated on Nvidia Tesla V100 and K80. Numerical tests under different mesh scales show that the effects of mesh loop modes are different on data locality and data dependence. Specifically, face loop makes the best data locality, so long as access to face data exists in kernels. Cell loop brings the smallest overheads due to non-coalescing data access, when both cell and node data are used in computing without face data. Cell loop owns the best performance in the condition that only indirect access of cell data exists in kernels. Atomic operations reduced the performance of kernels largely in K80, which is not obvious on V100. With the suitable mesh loop mode in all kernels, the overall performance of GPU simulations can be increased by 15%-20%. Finally, the program on a single GPU V100 can achieve maximum 21.7 and average 14.1 speed up compared with 28 MPI tasks on two Intel CPUs Xeon Gold 6132.

show abstract

Locality optimized unstructured mesh algorithms on GPUs

Cited by 12 publications

References 30 publications

OP2-Clang: A Source-to-Source Translator Using Clang/LLVM LibTooling

OP2-Clang: A Source-to-Source Translator Using Clang/LLVM LibTooling

Under the Hood of SYCL – An Initial Performance Analysis with An Unstructured-Mesh CFD Application

Effects of mesh loop modes on performance of unstructured finite volume GPU simulations

Contact Info

Product

Resources

About