High Performance Multi-GPU SpMV for Multi-component PDE-Based Applications

Abdelfattah, Ahmad; Ltaief, Hatem; Keyes, David E.

doi:10.1007/978-3-662-48096-0_46

Cited by 10 publications

(21 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…KAUST BLAS (KBLAS) is an open‐source library that provides highly optimized implementations for a subset of BLAS routines on NVIDIA GPUs as well as x86 architectures . In particular, the authors have already demonstrated significant performance gains for IP TRSM and TRMM against cuBLAS IP and MAGMA OOP implementations on a single NVIDIA GPU .…”

Section: Related Workmentioning

confidence: 99%

A framework for dense triangular matrix kernels on various manycore architectures

Charara

Keyes

Ltaief

2017

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

SummaryWe present a new high-performance framework for dense triangular Basic Linear Algebra Subroutines (BLAS) kernels, ie, triangular matrix-matrix multiplication (TRMM) and triangular solve (TRSM), on various manycore architectures. This is an extension of a previous work on a single GPU by the same authors, presented at the EuroPar'16 conference, in which we demonstrated the effectiveness of recursive formulations in enhancing the performance of these kernels. In this paper, the performance of triangular BLAS kernels on a single GPU is further enhanced by implementing customized in-place CUDA kernels for TRMM and TRSM, which are called at the bottom of the recursion. In addition, a multi-GPU implementation of TRMM and TRSM is proposed and we show an almost linear performance scaling, as the number of GPUs increases.Finally, the algorithmic recursive formulation of these triangular BLAS kernels is in fact oblivious to the targeted hardware architecture. We, therefore, port these recursive kernels to homoge- In this paper, we extend the work in Charara et al 8 -in which we demonstrated the effectiveness of recursive formulations in enhancing the performance of these kernels-and present a new

show abstract

Section: Related Workmentioning

confidence: 99%

A framework for dense triangular matrix kernels on various manycore architectures

Charara

Keyes

Ltaief

2017

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

show abstract

“…However, this distribution, which has been used in [2], does not work well for all matrices, unless the row lengths are balanced, as proposed in Section 5.5. Block rows of the matrix, preferably reordered according to their lengths, will be distributed among GPUs in a 1D cyclic manner.…”

Section: Multi-gpu Kernelsmentioning

confidence: 99%

“…Such matrices are not necessarily block-sparse; however, we are interested in their structures, as inherited from spatial discretization. KSPARSE using BSR format [2]. Such approach enables us to test the performance of the proposed kernels against a wide range of sparsity patterns.…”

Section: System Setupmentioning

confidence: 99%

“…Table II shows some properties of the selected matrices. Such structures create a high load imbalance for the original KSPARSE kernel [2], due to the fact that it assigns an entire row to the same group of threads. Figure 9 shows the structures of these matrices.…”

Section: System Setupmentioning

confidence: 99%

“…(2) Instead of matrices with synthetic structure, the performance reported in this paper is based on matrices with structures arising from real applications. Based on the histogram of the nonzero row lengths of these matrices, we show that the performance of the original kernel [2] may be unacceptable and propose a modification that enables a much better performance. This gives insight on the performance of the proposed kernel on more realistic structures.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Performance optimization of Sparse Matrix‐Vector Multiplication for multi‐component PDE‐based applications using GPUs

Abdelfattah

Ltaief

Keyes

et al. 2016

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

Simulations of many multi-component PDE-based applications, such as petroleum reservoirs or reacting flows, are dominated by the solution, on each time step and within each Newton step, of large sparse linear systems. The standard solver is a preconditioned Krylov method. Along with application of the preconditioner, memory-bound Sparse Matrix-Vector Multiplication (SpMV) is the most time-consuming operation in such solvers. Multi-species models produce Jacobians with a dense block structure, where the block size can be as large as a few dozen. Failing to exploit this dense block structure vastly underutilizes hardware capable of delivering high performance on dense BLAS operations. This paper presents a GPU-accelerated SpMV kernel for block-sparse matrices. Dense matrix-vector multiplications within the sparse-block structure leverage optimization techniques from the KBLAS library, a high performance library for dense BLAS kernels. The design ideas of KBLAS can be applied to block-sparse matrices. Furthermore, a technique is proposed to balance the workload among thread blocks when there are large variations in the lengths of nonzero rows. Multi-GPU performance is highlighted. The proposed SpMV kernel outperforms existing state-of-the-art implementations using matrices with real structures from different applications. Copyright Because of its importance and wide use, the literature is rich in contributions for proposed SpMV implementations for several formats, including Compressed Sparse Row (CSR), ELLPACK [5], and the Coordinate (COO) format. They also proposed HYB, which is a hybrid format that combines both the ELLPACK format and the COO format, in an effort to reduce the padding overhead of ELLPACK. Most of these implementations (probably more optimized) are available in the cuSPARSE library [6], as are the baselines against which most researchers compare their techniques. The four formats (CSR, ELLPACK, COO, and HYB) are shown in Figure 1.Perhaps the ELLPACK format [5] is the most convenient format for GPUs, because a sparse matrix A is stored as a dense matrix (in column major format) with dimensions m nnz max , where m Figure 1. Representation of a block-sparse matrix by different formats.is the number of rows of A and nnz max is the maximum number of non-zeros found in the rows of A (Figure 1(b)). Another dense matrix is required to store the integer column indices of the non-zeros. The regularity of ELLPACK format is obtained at the cost of introducing zero padding overhead, when there is a variation in the row lengths of A. The overhead is reflected in extra memory reads plus extra computation.Many researchers have proposed remedies to the ELLPACK overhead. Monakov et al.[7] proposed a sliced version of the ELLPACK format, where each slice is stored in a separate ELL-PACK format. The slice size can be fixed or variable, and the zero padding can be even reduced by reordering the rows according to their lengths. Vázquez et al. [8] proposed the ELLPACK-R format that adds auxiliary information to avoid th...

show abstract

The spectral cell method for wave propagation in heterogeneous materials simulated on multiple GPUs and CPUs

2018

View full text Add to dashboard Cite

High Performance Multi-GPU SpMV for Multi-component PDE-Based Applications

Cited by 10 publications

References 17 publications

A framework for dense triangular matrix kernels on various manycore architectures

A framework for dense triangular matrix kernels on various manycore architectures

Performance optimization of Sparse Matrix‐Vector Multiplication for multi‐component PDE‐based applications using GPUs

The spectral cell method for wave propagation in heterogeneous materials simulated on multiple GPUs and CPUs

Contact Info

Product

Resources

About