OpenACC-based GPU Acceleration of a 3-D Unstructured Discontinuous Galerkin Method

Xia, Yidong; Luo, Hong; Luo, Lixiang; Edwards, Jack R.; Lou, Jialin

doi:10.2514/6.2014-1129

Cited by 11 publications

(11 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With the assistance of convenient development tools many software engineering issues of porting CFD codes can be trivially resolved, including data management and efficient data exchange between multiple GPGPUs. Our previous attempts of porting CFD codes to GPGPU using OpenACC and GPGPU-aware MPI implementations have proven the advantages on portability and maintainability of this approach [2,7]. More fundamental performance restrictions are encountered when porting implicit CFD solvers.…”

Section: Introductionmentioning

confidence: 97%

See 1 more Smart Citation

Optimization of A Fine-grained BILU by CUDA Inter-block Synchronization

Luo

Edwards

Luo

et al. 2015

22nd AIAA Computational Fluid Dynamics Conference

Self Cite

View full text Add to dashboard Cite

A fine-grained block incomplete LU (FGBILU) factorization for solving large-scale block-sparse linear systems resulting from coupled PDE systems with n equations has been recently developed for massively parallel heterogeneous architectures, such as generalpurpose graphics processing units (GPGPUs). A straightforward one-sweep wavefront ordering is combined with element-wise block submatrix operations, allowing FGBILU to achieve low-overhead concurrent computation at O n 2 N 2 scale on a 3D PDE domain with a linear scale of N . Numerical experiments show that FGBILU is less efficient on smaller domains. Besides the inevitable performance penalty of a wavefront ordering, the index reconstruction by each concurrent computation thread causes considerable parallelism overhead. One way to reduce the overhead is to employ thread recycling along with CUDA inter-block synchronization. Dynamic parallelism is also attempted, although with no significant perforamnce benefit. The improved FGBILU is tested for a series of 3D PDE domains extracted from an incompressible Navier-Stokes solver called INCOMP3D. Results show that thread recycling can significantly reduce parallelism overhead and improve the performance of FGBILU on smaller domains.

show abstract

Section: Introductionmentioning

confidence: 97%

“…Recent development on hardware and software technologies of General Purpose Graphics Processing Unit (GPGPU) has greatly improved its potential for large-scale high performance computation (see, for example [1,2,3,4,5]). Maturing programming frameworks have allowed GPGPU algorithm designs to gain more popularity.…”

Section: Introductionmentioning

confidence: 99%

Optimization of A Fine-grained BILU by CUDA Inter-block Synchronization

Luo

Edwards

Luo

et al. 2015

22nd AIAA Computational Fluid Dynamics Conference

Self Cite

View full text Add to dashboard Cite

show abstract

“…Recently, general-purpose graphics processing units (GPGPU) have attracted much attention as a promising technology for large-scale high performance computation (see, for example, [10][11][12][13][14][15][16]). A GPU, which is essentially a shared memory vector machine, has a potential to achieve one or two orders of magnitude of performance improvement for highly parallel algorithms.…”

Section: Introductionmentioning

confidence: 99%

A fine-grained block ILU scheme on regular structures for GPGPUs

et al. 2015

Self Cite

View full text Add to dashboard Cite

“…Many successful attempts have been reported in recent years (see, for example [1,2,3,4,5,6]). Although early attempts of utilizing GPGPU for CFD has been hampered by the heterogeneous nature of GPGPU hardware and complex programming tools, recent GPU technology has seen significant improvement on programming toolchain.…”

Section: Introductionmentioning

confidence: 99%

“…By allowing programmers to use a collection of compiler directives to specify loops and regions of their codes to be offloaded from a host CPU to GPGPU, this programming model offers a good balance between porting efforts and performance gain. As shown in [8,6], our previous attempts of porting existing CFD codes to GPGPU using OpenACC and MVAPICH2 [9] have proven the advantages on portability and maintainability of this approach.…”

Section: Introductionmentioning

confidence: 99%