2015
DOI: 10.1016/j.jpdc.2015.06.010
|View full text |Cite
|
Sign up to set email alerts
|

A framework for general sparse matrix–matrix multiplication on GPUs and heterogeneous processors

Abstract: h i g h l i g h t s• We design a framework for SpGEMM on modern manycore processors using the CSR format.• We present a hybrid method for pre-allocating the resulting sparse matrix.• We propose an efficient parallel insert method for long rows of the resulting matrix.• We develop a heuristic-based load balancing strategy. • Our approach significantly outperforms other known CPU and GPU SpGEMM methods. a b s t r a c tGeneral sparse matrix-matrix multiplication (SpGEMM) is a fundamental building block for numero… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
72
0

Year Published

2016
2016
2021
2021

Publication Types

Select...
6
2
1

Relationship

2
7

Authors

Journals

citations
Cited by 89 publications
(72 citation statements)
references
References 49 publications
0
72
0
Order By: Relevance
“…Notable applications using such operations nowadays include, for example, machine learning [14], weather pattern analsis [15], shortest path problem etc. [16]. 2) Various compute intensities correspond to various operations such as: higher for matrix multiplication, lower for addition of vectors etc.…”
Section: B Tests and Resultsmentioning
confidence: 99%
“…Notable applications using such operations nowadays include, for example, machine learning [14], weather pattern analsis [15], shortest path problem etc. [16]. 2) Various compute intensities correspond to various operations such as: higher for matrix multiplication, lower for addition of vectors etc.…”
Section: B Tests and Resultsmentioning
confidence: 99%
“…We use Nvidia K40m (Kepler) and Titan X (Pascal) GPUs for comparing the performance of our algorithm and several existing methods (CUSP [1], cuSPARSE, bhSPARSE [5] and RMerge [3]) that compute C = A 2 in double precision. The CUDA versions are 7.0 and 8.0 on K40m and Titan X, respectively.…”
Section: Performance Evaluation and Conclusionmentioning
confidence: 99%
“…The biggest challenges are (i) that the structure of the resulting matrix depends on the input matrices, (ii) that the organization of the entries in the resulting matrix requires communication between threads, and (iii) that the number of operations carried out by individual threads may vary strongly. To provide an efficient implementation we take advantage of the algorithmic description of bhSparse [LV15]. We tackle the aforementioned issues in a four stage approach: In the first stage, we compute an upper bound for the number for nonzeros in each column of the result matrix, which allows for allocating sufficient storage.…”
Section: Parallel Gpu Implementationmentioning
confidence: 99%
“…Note that M v implies summation on the compressed direction. For reference, we report the timings for cuSparse [NVI15] and Bhsparse[LV15].…”
mentioning
confidence: 99%