2019
DOI: 10.1016/j.jpdc.2019.07.011
|View full text |Cite
|
Sign up to set email alerts
|

Locality optimized unstructured mesh algorithms on GPUs

Abstract: Unstructured-mesh based numerical algorithms such as finite volume and finite element algorithms form an important class of applications for many scientific and engineering domains. The key difficulty in achieving higher performance from these applications is the indirect accesses that lead to data-races when parallelized. Current methods for handling such data-races lead to reduced parallelism and suboptimal performance. Particularly on modern many-core architectures, such as GPUs, that has increasing core/th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
9
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 12 publications
(10 citation statements)
references
References 30 publications
1
9
0
Order By: Relevance
“…Applying this strategy to CUDA, means that thread blocks need to be coloured and no two thread blocks will write to the same data. However for CUDA, previous work [43] has shown that a further level of colouring gives better performance. In this case the threads within a thread block is also coloured to avoid data races.…”
Section: A Cudamentioning
confidence: 98%
“…Applying this strategy to CUDA, means that thread blocks need to be coloured and no two thread blocks will write to the same data. However for CUDA, previous work [43] has shown that a further level of colouring gives better performance. In this case the threads within a thread block is also coloured to avoid data races.…”
Section: A Cudamentioning
confidence: 98%
“…Coloring can be similarly used for parallelizing on GPUs. Given the larger number of threads executable on GPUs, and the availability of GPU shared memory, different variations of coloring can be used [21]. For distributed memory parallelizations, such as using MPI, explicitly partitioning the mesh and assigning them to different processors leads to a decomposition of work that only have the potential to overlap at the boundaries of the partitions.…”
Section: Parallelizing Unstructured-mesh Applicationsmentioning
confidence: 99%
“…An owner compute model with redundant computation can be used in this case to handle data races [14]. Other strategies published for parallelizing unstructured-mesh applications have included the use of a large temporary array [1] and atomics [21]. Using a large temporary array entails storing the indirect increments for the nodes in a staging array, during the edge loop, for example, and then a separate iteration over the nodes to apply the increments from the temporary array on to the nodal data.…”
Section: Parallelizing Unstructured-mesh Applicationsmentioning
confidence: 99%
See 1 more Smart Citation
“…However, for complex geometry, the effects of reordering are weak [23]. Some researchers studied the influence of SOA and AOS data layout on data locality [24]. However, indirect memory access still exists in different data layouts.…”
Section: Introductionmentioning
confidence: 99%