2016
DOI: 10.1016/j.jpdc.2016.03.011
|View full text |Cite
|
Sign up to set email alerts
|

Optimization techniques for sparse matrix–vector multiplication on GPUs

Abstract: Sparse linear algebra is fundamental to numerous areas of applied mathematics, science and engineering. In this paper, we propose an efficient data structure named AdELL+ for optimizing the SpMV kernel on GPUs, focusing on performance bottlenecks of sparse computation. The foundation of our work is an ELL-based adaptive format which copes with matrix irregularity using balanced warps composed using a parametrized warp-balancing heuristic. We also address the intrinsic bandwidth-limited nature of SpMV with warp… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
7
0

Year Published

2017
2017
2024
2024

Publication Types

Select...
7
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 18 publications
(7 citation statements)
references
References 35 publications
0
7
0
Order By: Relevance
“…Finally, we compare the performance of HYB-SCF against HYB-MAGGIONI. Table X shows approximate speed-up achieved by HYB-MAGGIONI over CPU-CPLEX for each test problem, as reported by Maggioni [45]. It is evident from these speed-up measurements that HYB-MAGGIONI outperformed CPU-CPLEX for only three test problems; RAIL507, RAIL2586 and KARTED.…”
Section: B Hyb-scf Vs Cpu-cplex and Hyb-maggionimentioning
confidence: 87%
“…Finally, we compare the performance of HYB-SCF against HYB-MAGGIONI. Table X shows approximate speed-up achieved by HYB-MAGGIONI over CPU-CPLEX for each test problem, as reported by Maggioni [45]. It is evident from these speed-up measurements that HYB-MAGGIONI outperformed CPU-CPLEX for only three test problems; RAIL507, RAIL2586 and KARTED.…”
Section: B Hyb-scf Vs Cpu-cplex and Hyb-maggionimentioning
confidence: 87%
“…In this work, we use the systolic array architecture for the Gramian matrix computation. A systolic array architecture is produced by the interconnection of a set of attached data processing units (DPU) in a regular way [32], [33]. In parallel, each unit or cell receives data from its upstream neighbors to calculate a part of the result.…”
Section: ) Systolic Array Architecture For Gramian Matrix Computationmentioning
confidence: 99%
“…For example, the performance of sparse matrix-vector multiplications (SpMV) on GPU has a strong dependence on the input sparse matrix (Bell and Garland, 2008). Many studies showed the benefits of auto-tuning for SpMV (Reguly and Giles, 2012;Ashari et al, 2014;Liu and Vinter, 2015;Maggioni and Berger-Wolf, 2016). In astrophysics, Ishiyama et al (2009);Ishiyama et al (2012) achieved a good load balance for their massively parallel TreePM code by incorporating onthe-fly measurements for the execution time of each function within the simulation.…”
Section: Introductionmentioning
confidence: 99%