Accelerating Conjugate Gradient using OmpSs

Catalán, Sandra; Martorell, Xavier; Labarta, Jesús; Usui, Tetsuzo; Diaz, Leonel Antonio Toledo; Valero-Lara, Pedro

doi:10.1109/pdcat46702.2019.00033

Cited by 8 publications

(3 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Listing 2 shows the modifications in the algorithm. These modifications consist of swapping the order of execution of some of the kernels, mainly AXPY operations and DOT products [12], exposing a higher parallelism.…”

Section: Optimized Conjugate Gradient Methodsmentioning

confidence: 99%

“…Those HPC applications that are composed of multiple memory bound kernels which have to perform the operations repeatedly or have an iterative nature, such as those leveraged in this work, but many others as well, such as CFD simulations [4][5][6][7], image processing [8,9], AI kernels [10], or Linear Algebra kernels [11][12][13][14], just to mention of few, can benefit from the use of Static Graphs by reducing the CPU-GPU communication overhead and achieving higher GPU occupancy. To the best of our knowledge, this is the first time that CUDA Graph has been integrated with OpenACC and effectively adapted to the two different algorithms used as test cases in this work: the Conjugate Gradient Method and Particle Swarm Optimization.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Towards Enhancing Coding Productivity for GPU Programming Using Static Graphs

et al. 2022

Self Cite

View full text Add to dashboard Cite

The main contribution of this work is to increase the coding productivity of GPU programming by using the concept of Static Graphs. GPU capabilities have been increasing significantly in terms of performance and memory capacity. However, there are still some problems in terms of scalability and limitations to the amount of work that a GPU can perform at a time. To minimize the overhead associated with the launch of GPU kernels, as well as to maximize the use of GPU capacity, we have combined the new CUDA Graph API with the CUDA programming model (including CUDA math libraries) and the OpenACC programming model. We use as test cases two different, well-known and widely used problems in HPC and AI: the Conjugate Gradient method and the Particle Swarm Optimization. In the first test case (Conjugate Gradient) we focus on the integration of Static Graphs with CUDA. In this case, we are able to significantly outperform the NVIDIA reference code, reaching an acceleration of up to 11× thanks to a better implementation, which can benefit from the new CUDA Graph capabilities. In the second test case (Particle Swarm Optimization), we complement the OpenACC functionality with the use of CUDA Graph, achieving again accelerations of up to one order of magnitude, with average speedups ranging from 2× to 4×, and performance very close to a reference and optimized CUDA code. Our main target is to achieve a higher coding productivity model for GPU programming by using Static Graphs, which provides, in a very transparent way, a better exploitation of the GPU capacity. The combination of using Static Graphs with two of the current most important GPU programming models (CUDA and OpenACC) is able to reduce considerably the execution time w.r.t. the use of CUDA and OpenACC only, achieving accelerations of up to more than one order of magnitude. Finally, we propose an interface to incorporate the concept of Static Graphs into the OpenACC Specifications.

show abstract

Section: Optimized Conjugate Gradient Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Towards Enhancing Coding Productivity for GPU Programming Using Static Graphs

et al. 2022

Self Cite

View full text Add to dashboard Cite

show abstract

“…Indeed, a myriad of numerical simulation applications, commercial and ad hoc solutions, uses non stationary iterative methods because of their high effectiveness and robustness whenever solving linear systems of equations [8]. The most popular solvers included in this category are: the Conjugated Gradient (CG) which requires one SpMV product per iteration [9], the Generalized Minimum Residual Method (GMRES) that also uses one SpMV product per iteration [4], the BiConjugate Gradient (BiCG) that needs two SpMV products per iteration [10], and the BiConjugate Gradient Stabilised (BiCGS) that also needs two SpMV products per iteration [11]. Optimising the SpMV product on modern multi and many-core processors for general sparse matrices is not a trivial task because, in order to harness the strong parallel-processing capabilities of these devices, the computations require to have regular execution paths and memory access patterns, which are hardly ever present in the sparse matrices generated by real life numerical applications.…”

Section: Introduction and Related Workmentioning

confidence: 99%