Acceleration of tensor-product operations for high-order finite element methods

Świrydowicz, Kasia; Chalmers, Noel; Karakus, Ali; Warburton, Tim

doi:10.1177/1094342018816368

Cited by 56 publications

(49 citation statements)

References 29 publications

(47 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On Summit, we found that highly tuned OCCA kernels could deliver 2 TFLOPS, which is a significant fraction of the peak 7 TFLOPS cited for a single V100 GPU. Moreover, the OCCA performance is near the bounds established by bandwidth-limited roofline performance model for BK5 (and, for other kernels, as shown by Świrydowicz et al, 2019). For BP5, Summit realizes in excess of 10,000 MDOFs per node, a factor of 125 larger than a single node of Cetus.…”

Section: Discussionsupporting

confidence: 57%

“…Threads within a block are assigned in a 2-D layout with an i – j column of the i – j – k spectral element data assigned to each thread. The individual kernel formulations, K = 1 to 10, constitute successive improvements in memory management that are described briefly in the Appendix 1 and in detail in Świrydowicz et al (2019). Kernel 1 sustains 500 GFLOPS for p = 8–15, while Kernels 9 and 10 reach a peak of 2 TFLOPS for the highest polynomial orders.…”

Section: Bake-off Performance On Summitmentioning

confidence: 99%

“…Each kernel K includes the optimizations in kernel K − 1 unless otherwise indicated. These kernels are described in detail by Świrydowicz et al (2019). Kernel 1 corresponds to a straightforward implementation of (22). Each spectral element is assigned to a separate thread block on the GPU with a 2-D thread structure.…”

Section: Algorithmmentioning

confidence: 99%

See 2 more Smart Citations

Scalability of high-performance PDE solvers

Fischer

Min

Rathnayake

et al. 2020

The International Journal of High Performance Computing Applica

Self Cite

103

View full text Add to dashboard Cite

Performance tests and analyses are critical to effective high-performance computing software development and are central components in the design and implementation of computational algorithms for achieving faster simulations on existing and future computing architectures for large-scale application problems. In this article, we explore performance and space-time trade-offs for important compute-intensive kernels of large-scale numerical solvers for partial differential equations (PDEs) that govern a wide range of physical applications. We consider a sequence of PDE-motivated bake-off problems designed to establish best practices for efficient high-order simulations across a variety of codes and platforms. We measure peak performance (degrees of freedom per second) on a fixed number of nodes and identify effective code optimization strategies for each architecture. In addition to peak performance, we identify the minimum time to solution at 80% parallel efficiency. The performance analysis is based on spectral and p-type finite elements but is equally applicable to a broad spectrum of numerical PDE discretizations, including finite difference, finite volume, and h-type finite elements.

show abstract

Section: Discussionsupporting

confidence: 57%

Section: Bake-off Performance On Summitmentioning

confidence: 99%

See 1 more Smart Citation

Scalability of high-performance PDE solvers

Fischer

Min

Rathnayake

et al. 2020

The International Journal of High Performance Computing Applica

Self Cite

103

View full text Add to dashboard Cite

show abstract

“…To put the performance of the fully merged case on Intel Skylake into perspective, we compare with executing the plain CG method on an Nvidia V100 GPU using the implementation from [51,57]: even though the GPU runs with around 700 GB/s of memory throughput, the performance is higher on Intel Skylake with only 200 GB/s from RAM memory because the merged loops significantly increase data locality. Furthermore, on the GPU we do not compute the metric terms on the fly, but load a precomputed tensor J −1 K J −T K det(J K )w q which is faster due to reduced register pressure, see also the analysis for BP5 in [71]. We also note that the GPU results with our implementation are faster than an implementation with the OCCA library described in [29] with up to 0.6 billion DoFs/s on a V100 of the Summit supercomputer.…”

Section: Performance-optimized Conjugate Gradient Methodsmentioning

confidence: 99%

ExaDG: High-Order Discontinuous Galerkin for the Exa-Scale

Arndt

Fehn

Kanschat

et al. 2020

Lecture Notes in Computational Science and Engineering

View full text Add to dashboard Cite

This text presents contributions to efficient high-order finite element solvers in the context of the project ExaDG, part of the DFG priority program 1648 Software for Exascale Computing (SPPEXA). The main algorithmic components are the matrix-free evaluation of finite element and discontinuous Galerkin operators with sum factorization to reach a high node-level performance and parallel scalability, a massively parallel multigrid framework, and efficient multigrid smoothers. The algorithms have been applied in a computational fluid dynamics context. The software contributions of the project have led to a speedup by a factor 3 − 4 depending on the hardware. Our implementations are available via the deal.II finite element library.

show abstract

“…We also show that the majority of the computational costs during the elliptic solvers is contained in the action of the elliptic operators, and we detail the GPU performance of these operators. In order to asses the performance of our computational kernels, we use an empirical roofline model (Volkov and Demmel, 2008;Swirydowicz et al, 2017). The model relies on the observation that the GPU is typically a memory-bound device; the runtime of a kernel cannot be faster than the time needed to transfer the data used in the kernel.…”

mentioning

confidence: 99%