2019
DOI: 10.1177/1094342018816368
|View full text |Cite
|
Sign up to set email alerts
|

Acceleration of tensor-product operations for high-order finite element methods

Abstract: This paper is devoted to GPU kernel optimization and performance analysis of three tensorproduct operators arising in finite element methods. We provide a mathematical background to these operations and implementation details. Achieving close-to-the-peak performance for these operators requires extensive optimization because of the operators' properties: low arithmetic intensity, tiered structure, and the need to store intermediate results inside the kernel. We give a guided overview of optimization strategies… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
48
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
3

Relationship

1
6

Authors

Journals

citations
Cited by 56 publications
(49 citation statements)
references
References 29 publications
(47 reference statements)
1
48
0
Order By: Relevance
“…On Summit, we found that highly tuned OCCA kernels could deliver 2 TFLOPS, which is a significant fraction of the peak 7 TFLOPS cited for a single V100 GPU. Moreover, the OCCA performance is near the bounds established by bandwidth-limited roofline performance model for BK5 (and, for other kernels, as shown by Świrydowicz et al, 2019). For BP5, Summit realizes in excess of 10,000 MDOFs per node, a factor of 125 larger than a single node of Cetus.…”
Section: Discussionsupporting
confidence: 57%
See 2 more Smart Citations
“…On Summit, we found that highly tuned OCCA kernels could deliver 2 TFLOPS, which is a significant fraction of the peak 7 TFLOPS cited for a single V100 GPU. Moreover, the OCCA performance is near the bounds established by bandwidth-limited roofline performance model for BK5 (and, for other kernels, as shown by Świrydowicz et al, 2019). For BP5, Summit realizes in excess of 10,000 MDOFs per node, a factor of 125 larger than a single node of Cetus.…”
Section: Discussionsupporting
confidence: 57%
“…Threads within a block are assigned in a 2-D layout with an i – j column of the i – j – k spectral element data assigned to each thread. The individual kernel formulations, K = 1 to 10, constitute successive improvements in memory management that are described briefly in the Appendix 1 and in detail in Świrydowicz et al (2019). Kernel 1 sustains 500 GFLOPS for p = 8–15, while Kernels 9 and 10 reach a peak of 2 TFLOPS for the highest polynomial orders.…”
Section: Bake-off Performance On Summitmentioning
confidence: 99%
See 1 more Smart Citation
“…To put the performance of the fully merged case on Intel Skylake into perspective, we compare with executing the plain CG method on an Nvidia V100 GPU using the implementation from [51,57]: even though the GPU runs with around 700 GB/s of memory throughput, the performance is higher on Intel Skylake with only 200 GB/s from RAM memory because the merged loops significantly increase data locality. Furthermore, on the GPU we do not compute the metric terms on the fly, but load a precomputed tensor J −1 K J −T K det(J K )w q which is faster due to reduced register pressure, see also the analysis for BP5 in [71]. We also note that the GPU results with our implementation are faster than an implementation with the OCCA library described in [29] with up to 0.6 billion DoFs/s on a V100 of the Summit supercomputer.…”
Section: Performance-optimized Conjugate Gradient Methodsmentioning
confidence: 99%
“…We also show that the majority of the computational costs during the elliptic solvers is contained in the action of the elliptic operators, and we detail the GPU performance of these operators. In order to asses the performance of our computational kernels, we use an empirical roofline model (Volkov and Demmel, 2008;Swirydowicz et al, 2017). The model relies on the observation that the GPU is typically a memory-bound device; the runtime of a kernel cannot be faster than the time needed to transfer the data used in the kernel.…”
mentioning
confidence: 99%