GPU-accelerated preconditioned GMRES method for two-dimensional Maxwell's equations

Gao, Jiaquan; Wu, Kesong; Wang, Yushun; Qi, Panpan; He, Gaohong

doi:10.1080/00207160.2017.1280156

Cited by 14 publications

(7 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…PETSc implements several efficient subroutines to conduct the ILU factorizations and preconditioning operations for block sparse matrices with various block sizes. In the first implementation, we consider performing (8) using PETSc's API on CPUs and copying the preconditioned vector to the device memory at every iteration of Algorithm 1. e developed function is listed in Algorithm 4. e function accepts four input variables as ksp, vecv, vecz, and dv, where ksp is an instance of the PETSc's KSP structure, vecv and vecz, encapsulated by the PETSc's Vec structure in the host memory, are the left-and right-side vectors in equation (8), and dv is the device pointer to the left side vector on the GPU. e function outputs the pointer to the right-side vector.…”

Section: Preconditioningmentioning

confidence: 99%

“…Yang [6,7] developed the preconditioned GMRES algorithm by parallelizing the ILU(0), ILUT, block ILU(k), and triangular solves on GPUs. Gao et al [8] proposed an efficient GPU kernel on the sparse matrixvector multiplication (SpMV) in GMRES and applied the optimized GMRES to solving the two-dimensional Maxwell's equations. He et al [9] presented an efficient GPU implementation of the GMRES with ILU preconditioners for solving large linear dynamic systems.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Developing a Multi-GPU-Enabled Preconditioned GMRES with Inexact Triangular Solves for Block Sparse Matrices

Yuan

et al. 2021

Mathematical Problems in Engineering

View full text Add to dashboard Cite

Solving triangular systems is the building block for preconditioned GMRES algorithm. Inexact preconditioning becomes attractive because of the feature of high parallelism on accelerators. In this paper, we propose and implement an iterative, inexact block triangular solve on multi-GPUs based on PETSc’s framework. In addition, by developing a distributed block sparse matrix-vector multiplication procedure and investigating the optimized vector operations, we form the multi-GPU-enabled preconditioned GMRES with the block Jacobi preconditioner. In the implementation, the GPU-Direct technique is employed to avoid host-device memory copies. The preconditioning step used by PETSc’s structure and the cuSPARSE library are also investigated for performance comparisons. The experiments show that the developed GMRES with inexact preconditioning on 8 GPUs can achieve up to 4.4x speedup over the CPU-only implementation with exact preconditioning using 8 MPI processes.

show abstract

Section: Preconditioningmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Developing a Multi-GPU-Enabled Preconditioned GMRES with Inexact Triangular Solves for Block Sparse Matrices

Yuan

et al. 2021

Mathematical Problems in Engineering

View full text Add to dashboard Cite

show abstract

“…As mentioned above, there has been some work about parallel SPAI preconditioners on GPU. For example, Gao et al followed Chow's work, and use a sparse approximate inverse of A as the preconditioner in their work . However, they do not give any implementation description.…”

Section: Introductionmentioning

confidence: 99%

“…There exists some work on accelerating the construction of preconditioners with GPU. For example, the incomplete Cholesky factorization preconditioners on GPU, 25,26 the incomplete LU factorization preconditioners on GPU, [27][28][29][30][31][32][33] the FSAI preconditioners on GPU, [34][35][36][37] and the SPAI preconditioners on GPU, [38][39][40] and the preconditioners that consist of an incomplete factorization, followed by an approximate inversion of the incomplete factors on GPU. [41][42][43][44] Especially, some public libraries such as CUSPARSE, 45 CUSP, 46 and ViennaCL 47,48 also include parallel preconditioners on GPU.…”

Section: Introductionmentioning

confidence: 99%

An efficient sparse approximate inverse preconditioning algorithm on GPU

Yin

Gao

2019

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

Summary The sparse approximate inverse (SPAI) preconditioner has proven to be effective in accelerating the convergence of iterative methods. Recently, accelerating it on the graphics processing unit (GPU) has attracted considerable attention due to the fact that the cost of constructing it is high. This motivates us to investigate how to accelerate the construction of SPAI preconditioners on GPU in this paper. We propose an efficient sparse approximate inverse algorithm on GPU, called SPAI‐Adaptive. For our proposed SPAI‐Adaptive, there are the following novelties: (1) an adaptive thread allocation strategy for SPAI‐Adaptive is proposed to assign the optimal thread number for each column of the preconditioner, and (2) Each component of the preconditioner, which includes finding indices I and J, constructing local submatrix, decomposing the local matrix into QR, and solving the upper triangular linear system, is computed in parallel inside a thread group of GPU. Experimental results show that the proposed SPAI‐Adaptive is effective, and has good performance and high parallelism.

show abstract

“…The dense matrix‐vector multiplication routine performs one of

y : = A x false(GEMV false) or y : = A^{T} x false(GEMV‐T false),

where

A \in R^{m \times n}

is a dense matrix, and y and x are vectors. It has proven to be of particular importance in computational science and has been successfully applied in various fields …”

Section: Introductionmentioning

confidence: 99%

Efficient dense matrix‐vector multiplication on GPU

Gao

Wang

2018

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

Given that the dense matrix-vector multiplication (Ax or A T x) is of great importance in scientific computations, how to accelerate it is investigated on the graphics processing unit (GPU) in this paper. We present a warp-based implementation of Ax on the GPU, called GEMV-Adaptive, and a thread-based implementation of A T x on the GPU, called GEMV-T-Adaptive. For our proposed GEMV-Adaptive and GEMV-T-Adaptive, there are the following novelties: (1) an adaptive warp allocation strategy for GEMV-Adaptive is proposed to assign the optimal warp number for each matrix row, (2) an adaptive thread allocation strategy for GEMV-T-Adaptive is designed to assign the optimal thread number to each matrix row, and (3) several optimization schemes are formulated. Experimental results show that the proposed GEMV-Adaptive and GEMV-T-Adaptive mitigate the performance fluctuations of the implementations in the CUBLAS library, always have high performance, and outperform the most recently proposed GEMV and GEMV-T kernels by Gao et al, respectively, for all test matrices. KEYWORDS CUDA, dense matrix-vector multiplication, GPU INTRODUCTIONThe dense matrix-vector multiplication routine performs one ofwhere A ∈ R m×n is a dense matrix, and y and x are vectors. It has proven to be of particular importance in computational science and has been successfully applied in various fields. 1-7 As the matrix size increases in the practical problems, parallel computing is required to efficiently improve the performance of the dense matrix-vector multiplication.Processing big data by using graphics processing units (GPUs) has drawn much attention over the recent years. Following the introduction of the compute unified device architecture (CUDA), a programming model that supports the joint CPU/GPU execution of applications, by NVIDIA 8 in 2007,GPUs have become strong competitors as general-purpose parallel programming systems.Researchers have recently developed suitable and flexible dense matrix-vector multiplication algorithms on the GPU architecture. 9-14 The representative of them is KBLAS by Abdelfattah et al. 14 KBLAS is an optimized library for dense MVM kernels on GPUs and can efficiently run on various GPU architectures while avoiding code rewriting and retaining compliance with the standard BLAS API. Experimental results show that the KBLAS performance either matches or outperforms existing state-of-the-art open-source and commercial implementations (eg, NVIDIA's standard BLAS implementation CUBLAS, 15 MAGMABLAS, 16 and CULA 17 ) on different matrix sizes. A subset of KBLAS high-performance kernels has been integrated into CUBLAS for larger dissemination, starting with version 6.0. 18 Concurrency Computat Pract Exper. 2018;30:e4705. wileyonlinelibrary.com/journal/cpe

show abstract

GPU-accelerated preconditioned GMRES method for two-dimensional Maxwell's equations

Cited by 14 publications

References 16 publications

Developing a Multi-GPU-Enabled Preconditioned GMRES with Inexact Triangular Solves for Block Sparse Matrices

Developing a Multi-GPU-Enabled Preconditioned GMRES with Inexact Triangular Solves for Block Sparse Matrices

An efficient sparse approximate inverse preconditioning algorithm on GPU

Efficient dense matrix‐vector multiplication on GPU

Contact Info

Product

Resources

About