Efficient sparse-matrix multi-vector product on GPUs

Hong, Changwan; Sukumaran-Rajam, Aravind; Bandyopadhyay, Bortik; Kim, Jin-Sung; Kurt, Sureyya Emre; Nisa, Israt; Sabhlok, Shivani; Çatalyürek, Ümit V.; Parthasarathy, S.; Sadayappan, P.

doi:10.1145/3208040.3208062

Cited by 59 publications

(37 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…3b. The dense blocks are stored as dense matrices in row-major order with empty cells filled with zero, and the sparse block is stored in Compressed Sparse Row (CSR) format [22]. We took inspiration from the data reorganization idea proposed in [25] and proposed a new algorithm for extracting dense blocks in a sparse matrix.…”

Section: Overview Of Optprunementioning

confidence: 99%

“…As explained in the background section, sparse convolutions can be implemented as SpMM. Although previous works have studied SpMM on GPUs [22,23,25], their optimization techniques mainly target large sparse matrices with at least 10,000 rows and columns that are found in scientific computing applications, and they can not deliver good performance for sparse convolutions where the number of convolution kernels is usually smaller than 1000. In fact, we adopted a state-of-the-art implementation of SpMM from [25] for sparse convolution with real-world pruned models from [46], and we found that the sparse convolutions do not run much faster (and can even be slower) compared with the original dense convolutions implemented as GEMM.…”

Section: Implementing Sparse Convolutions With Gemmmentioning

confidence: 99%

“…However, existing SpMM implementations cannot deliver satisfactory performance for sparse convolutions. The reason is that these implementations mainly target large sparse matrices where data locality can be effectively improved by data reorganization [22,23,25], whereas the sparse matrices of the pruned convolution kernels are small and have less freedom of data reorganization. We observe that the main performance bottleneck in small sparse matrix multiplications is the control-flow instructions instead of data locality.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Accelerating Sparse CNN Inference on GPUs with Performance-Aware Weight Pruning

Rumi

Wang

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

Weight pruning is a popular technique to reduce the size and computation complexity of the Convolutional Neural Networks (CNNs). Despite its success in reducing the model size, weight pruning has brought limited benefit to the CNN inference performance, due to the irregularity introduced in the sparse convolution operations. In this work, we aim to improve the performance of sparse convolutions on GPUs by mitigating the irregularity. We find that the existing performance optimization techniques for sparse matrix computations fail to accelerate sparse convolutions, and we observe that the main performance bottleneck is caused by the heavy control-flow instructions. Based on the observation, we proposed a new GEMM-based implementation of sparse convolutions. Our main idea is to extract dense blocks of non-zeros in the sparse convolution kernels, and use dense matrix-matrix multiplication for these dense blocks to achieve high throughput. For cases where many non-zero weights cannot be grouped into dense blocks, we propose a performance-aware re-pruning strategy that removes the least important weights in the sparse kernels to further improve the throughput. The experimental results with five real-world pruned CNN models show that our techniques can significantly improve the layer-wise performance of sparse convolution operations as well as the end-to-end performance of CNN inference. CCS CONCEPTS • Computing methodologies → Neural networks; • Software and its engineering → Source code generation;

show abstract

Section: Overview Of Optprunementioning

confidence: 99%

Section: Implementing Sparse Convolutions With Gemmmentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Accelerating Sparse CNN Inference on GPUs with Performance-Aware Weight Pruning

Rumi

Wang

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

show abstract

“…Hong and et al proposed new sparse matrix format and algorithm for SpMM named Row-Segmented-SpMM (RS-SpMM) [9]. The sparse matrix is divided into two groups.…”

Section: Related Work a Spmm For Gpumentioning

confidence: 99%

Batched Sparse Matrix Multiplication for Accelerating Graph Convolutional Networks

Nagasaka

Nukada

Kojima

et al. 2019

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

View full text Add to dashboard Cite

Graph Convolutional Networks (GCNs) are recently getting much attention in bioinformatics and chemoinformatics as a state-of-the-art machine learning approach with high accuracy. GCNs process convolutional operations along with graph structures, and GPUs are used to process enormous operations including sparse-dense matrix multiplication (SpMM) when the graph structure is expressed as an adjacency matrix with sparse matrix format. However, the SpMM operation on small graph, where the number of nodes is tens or hundreds, hardly exploits high parallelism or compute power of GPU. Therefore, SpMM becomes a bottleneck of training and inference in GCNs applications. In order to improve the performance of GCNs applications, we propose new SpMM algorithm especially for small sparse matrix and Batched SpMM, which exploits high parallelism of GPU by processing multiple SpMM operations with single CUDA kernel. To the best of our knowledge, this is the first work of batched approach for SpMM. We evaluated the performance of the GCNs application on TSUBAME3.0 implementing NVIDIA Tesla P100 GPU, and our batched approach shows significant speedups of up to 1.59x and 1.37x in training and inference, respectively.

show abstract

“…These sparse formats, that are suitable for cache-aware CPU platforms, usually provide poor performance on GPUs. For this, new formats are developed to allow an efficient implementation of a sparse matrix-vector product on GPUs [17,68]. A sparse format included in cuSPARSE, specifically designed for the use on GPU is HYB.…”

Section: Cuda Softwarementioning

confidence: 99%

Dense and sparse parallel linear algebra algorithms on graphics processing units

Daviña¹

View full text Add to dashboard Cite

I thank again to José Román for his unique humor sense. To Carmen, for showing me the way and for all her good advices. To Enrique, who helped me to get rolling. And to the former members Andrés and Eloy, to whom I had the opportunity to meet and who enliven the group meals. I will keep good memories from these years. I do not want to forget to mention to Xavier Cartoixà, Jeff Steward and Altuǧ Aksoy, great researchers with whom I have had the opportunity to collaborate. The afternoon snacks would not have been the same without the excellent discussions and comments of Fernando, David and of course Salva, who, without noticing it, also helped to improve this dissertation. Last, I would like to thank to José Luis, IT staff of the department, for his high valuable work behind the scenes and his promptly response to any incidence. To all of you who have supported me, thank you. v de matriz de forma directa. Nosotros hemos implementado varios algoritmos para calcular las funciones de matrices raíz cuadrada y exponencial, en las que el uso de GPUs permite acelerar el cálculo. vi Contents List of Figures xi List of Tables xv List of Algorithms xvii

show abstract

Efficient sparse-matrix multi-vector product on GPUs

Cited by 59 publications

References 25 publications

Accelerating Sparse CNN Inference on GPUs with Performance-Aware Weight Pruning

Accelerating Sparse CNN Inference on GPUs with Performance-Aware Weight Pruning

Batched Sparse Matrix Multiplication for Accelerating Graph Convolutional Networks

Dense and sparse parallel linear algebra algorithms on graphics processing units

Contact Info

Product

Resources

About