Load-balancing Sparse Matrix Vector Product Kernels on GPUs

Anzt, Hartwig; Cojean, Terry; Chen, Yen‐Chen; Dongarra, Jack; Flegar, Goran; Nayak, Pratik; Tomov, Stanimire; Tsai, Yu‐Hsiang; Wang, Weichung

doi:10.1145/3380930

Cited by 35 publications

(33 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Given the different hardware characteristics, see Table 1, we optimize kernel parameters like group size for the distinct architectures. More relevant, for the CSR, ELL, and HYB kernels, we modify the SpMV execution strategy for the AMD architecture from the strategy that was previously realized for NVIDIA architectures [2].…”

Section: Sparse Matrix Vector Kernel Designsmentioning

confidence: 99%

“…In Algorithm 2, we assign a "subwarp" (multiple threads) to each row, and use warp reduction mechanisms to accumulate the partial results before writing to the output vector. This classical CSR assigning multiple threads to each row is inspired by the performance improvement of the ELL SpMV in [2]. We adjust the number of threads assigned to each row to the maximum number of nonzeros in a row.…”

Section: Csr Spmv Kernelmentioning

confidence: 99%

“…In [2], the authors demonstrated that the ELL SpMV kernel can be accelerated by assigning multiple threads to each row, and using an "early stopping" strategy to terminate thread blocks early if they reach the padding part of the ELL format. Porting this strategy to AMD architectures, we discovered that the non-coalesced global memory access possible when assigning multiple threads to the rows of the ELL matrix stored in column-major format can result in low performance.…”

Section: Ell Spmv Kernelmentioning

confidence: 99%

“…Porting this strategy to AMD architectures, we discovered that the non-coalesced global memory access possible when assigning multiple threads to the rows of the ELL matrix stored in column-major format can result in low performance. The reason behind this is that the strategy in [2] uses threads of the same group to handle one row, which results in adjacent threads always reading matrix elements that are m (matrix size or stride) memory locations apart. To overcome this problem, we rearrange the memory access by assigning the threads of the same group to handle one column like the classical ELL kernel, but assigning several groups to each row to increase GPU usage.…”

Section: Ell Spmv Kernelmentioning

confidence: 99%

“…Given the long list of efforts covering the design and evaluation of SpMV kernels on manycore processors, see [2,7] for a recent and comprehensive overview of SpMV research, we highlight that this work contains the following novel contributions:…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Sparse Linear Algebra on AMD and NVIDIA GPUs – The Race Is On

Tsai

Cojean

Anzt

2020

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Efficiently processing sparse matrices is a central and performance-critical part of many scientific simulation codes. Recognizing the adoption of manycore accelerators in HPC, we evaluate in this paper the performance of the currently best sparse matrix-vector product (SpMV) implementations on high-end GPUs from AMD and NVIDIA. Specifically, we optimize SpMV kernels for the CSR, COO, ELL, and HYB format taking the hardware characteristics of the latest GPU technologies into account. We compare for 2,800 test matrices the performance of our kernels against AMD's hipSPARSE library and NVIDIA's cuSPARSE library, and ultimately assess how the GPU technologies from AMD and NVIDIA compare in terms of SpMV performance.

show abstract

Section: Sparse Matrix Vector Kernel Designsmentioning

confidence: 99%

Section: Csr Spmv Kernelmentioning

confidence: 99%

Section: Ell Spmv Kernelmentioning

confidence: 99%

Section: Ell Spmv Kernelmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Sparse Linear Algebra on AMD and NVIDIA GPUs – The Race Is On

Tsai

Cojean

Anzt

2020

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

show abstract

Using Ginkgo's memory accessor for improving the accuracy of memory‐bound low precision BLAS

2021

Self Cite

View full text Add to dashboard Cite

The roofline model not only provides a powerful tool to relate an application's performance with the specific constraints imposed by the target hardware but also offers a graphic representation of the balance between memory access cost and compute throughput. In this work, we present a strategy to break up the tight coupling between the precision format used for arithmetic operations and the storage format employed for memory operations. (At a high level, this idea is equivalent to compressing/decompressing the data in registers before/after invoking store/load memory operations.) In practice, we demonstrate that a “memory accessor” that hides the data compression behind the memory access, can virtually push the bandwidth‐induced roofline, yielding higher performance for memory‐bound applications using high precision arithmetic that can handle the numerical effects associated with lossy compression. We also demonstrate that memory‐bound applications operating on low precision data can increase the accuracy by relying on the memory accessor to perform all arithmetic operations in high precision. In particular, we demonstrate that memory‐bound BLAS operations (including the sparse matrix‐vector product) can be re‐engineered with the memory accessor and that the resulting accessor‐enabled BLAS routines achieve lower rounding errors while delivering the same performance as the fast low precision BLAS.

show abstract

pSpMv: precision-based sparse matrix partition and SpMV optimization

Liu,

Wang,

Gao

et al. 2024

CCF Trans. HPC

View full text Add to dashboard Cite

The new generation of computing devices tends to support multiple floating-point formats and different computing precision. Besides single and double precision, half precision is embraced and widely supported by new computing devices. Low-precision representations have compact memory size and lightweight computing strength, and they also bring opportunities to the optimization of BLAS routines. This paper proposes a new sparse matrix partition approach based on IEEE 754 standard floating-point format. An input sparse matrix in double precision is partitioned and transformed into several sub-matrices in different precision without loss of accuracy. Most non-zero elements can be stored in half or single precision, if the most significant bits of exponent and the least significant bits of mantissa are zeros in double-precision representation. Based on this mixed-precision representation of sparse matrix, we also present a new SpMV algorithm pSpMV for GPU devices. pSpMV not only reduces the memory access overhead, but also reduces the computing strength of floating-point numbers. Experimental results on two GPU devices show that pSpMV achieves a geometric mean speedup of 1.39x on Tesla V100 and 1.45x on Tesla P100 over double-precision SpMV for 2,554 sparse matrices.

show abstract

Load-balancing Sparse Matrix Vector Product Kernels on GPUs

Cited by 35 publications

References 15 publications

Sparse Linear Algebra on AMD and NVIDIA GPUs – The Race Is On

Sparse Linear Algebra on AMD and NVIDIA GPUs – The Race Is On

Using Ginkgo's memory accessor for improving the accuracy of memory‐bound low precision BLAS

pSpMv: precision-based sparse matrix partition and SpMV optimization

Contact Info

Product

Resources

About