Efficient CSR-Based Sparse Matrix-Vector Multiplication on GPU

Gao, Jiaquan; Qi, Panpan; He, Gaohong

doi:10.1155/2016/4596943

Cited by 4 publications

(3 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Based on the CSR storage format, Bell and Garland proposed two classic parallel algorithms: CSR-Scalar [6] and CSR-Vector [6]. Lu et al [7]suggested filling the CSR array to optimize CSR-Scalar, achieving a 30% improvement in memory access performance; Dehnavi et al [8] put forward a prefetching CSR method that divides the non-zero elements of the matrix into blocks of the same size and allocates them to a GPU-like accelerator for computation; Greathouse and Daga [9]came up with the CSR-Adaptive algorithm, which dynamically selects between the CSR-Stream algorithm and the CSR-Vector algorithm according to the number of non-zero elements in each row, and uses effective reduction techniques for performance improvement; Gao et al [10]presented a PCSR algorithm to enhance the performance of the kernel by fully merging memory access to the CSR array, and made optimization on the basis of the algorithm, thus proposing the IPCSR algorithm, which reduces two kernels in PCSR to one while maintaining merged accesses to CSR arrays, saving the cost of loading global memory.…”

Section: Spmv Algorithm Based On Csrmentioning

confidence: 99%

Transplantation and optimization of SpMV algorithm based on DCU accelerator

Yue

et al. 2022

International Conference on Mechanisms and Robotics (ICMAR 2022)

View full text Add to dashboard Cite

In order to give full play to the advantages of DCU accelerator and solve the problems of algorithm SpMV(Sparse matrix-vector multiplication) with limited bandwidth, unbalanced load, and non-combined memory access, a SCSR(Static Compressed Sparse Row) using CSR(Compressed Sparse Row) storage format is proposed based on DCU accelerator. The algorithm statically allocates the same number of rows to each thread block according to the average number of non-zero elements in each row to avoid unnecessary computations; the application of storage resources is reduced by reusing the on-chip high-speed storage space LDS(Local Data Shared), thus improving the CU(Compute Unit) occupancy. The experiment uses 15 sparse matrices in different fields for testing. The results show that compared with the SpMV algorithm in the hipSPARSE library, the SCSR algorithm achieves an average speedup ratio of 4.83 times.

show abstract

Section: Spmv Algorithm Based On Csrmentioning

confidence: 99%

Transplantation and optimization of SpMV algorithm based on DCU accelerator

Yue

et al. 2022

International Conference on Mechanisms and Robotics (ICMAR 2022)

View full text Add to dashboard Cite

show abstract

“…Perfect-CSR. PCSR (Gao et al, 2016) consists of two main stages. The first stage launches as many blocks as the number of nonzero entries divided by the block dimensions (i.e., one thread per nonzero entry).…”

Section: Csr-adaptivementioning

confidence: 99%

“…The first step in the empirical evaluation of our proposal is the execution of all the selected variants of SpMV and a preliminary assessment of results. Our initial experiment compares the runtimes of the cuSparse CsrMV routine with the row-based and merge-path algorithms (which we refer to as cuS_RB and cuS_merge , respectively); the CUSP implementation of CSR-Vector ( cusp_vect ); an implementation of Liu and Vinter (2015) ( bhSparse ) published by the authors in their GitHub repository 4 ; and our own implementations of Liu and Schmidt (2018) ( light ), Gao et al (2016) ( pcsr ), and Greathouse and Daga (2014) ( adaptive ), based on codes found online. 5…”

Section: Experimental Evaluationmentioning

confidence: 99%

Selecting optimal SpMV realizations for GPUs via machine learning

Dufrechou

Ezzatti

Quintana–Ort́ı

2021

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

More than 10 years of research related to the development of efficient GPU routines for the sparse matrix-vector product (SpMV) have led to several realizations, each with its own strengths and weaknesses. In this work, we review some of the most relevant efforts on the subject, evaluate a few prominent routines that are publicly available using more than 3000 matrices from different applications, and apply machine learning techniques to anticipate which SpMV realization will perform best for each sparse matrix on a given parallel platform. Our numerical experiments confirm the methods offer such varied behaviors depending on the matrix structure that the identification of general rules to select the optimal method for a given matrix becomes extremely difficult, though some useful strategies (heuristics) can be defined. Using a machine learning approach, we show that it is possible to obtain unexpensive classifiers that predict the best method for a given sparse matrix with over 80% accuracy, demonstrating that this approach can deliver important reductions in both execution time and energy consumption.

show abstract