Optimization techniques for sparse matrix–vector multiplication on GPUs

Maggioni, Marco; Berger-Wolf, Tanya Y.

doi:10.1016/j.jpdc.2016.03.011

Cited by 18 publications

(7 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, we compare the performance of HYB-SCF against HYB-MAGGIONI. Table X shows approximate speed-up achieved by HYB-MAGGIONI over CPU-CPLEX for each test problem, as reported by Maggioni [45]. It is evident from these speed-up measurements that HYB-MAGGIONI outperformed CPU-CPLEX for only three test problems; RAIL507, RAIL2586 and KARTED.…”

Section: B Hyb-scf Vs Cpu-cplex and Hyb-maggionimentioning

confidence: 87%

On the Efficiency of Supernodal Factorization in Interior-Point Method Using CPU-GPU Collaboration

et al. 2020

View full text Add to dashboard Cite

Primal-dual interior-point method (PDIPM) is the most efficient technique for solving sparse linear programming (LP) problems. Despite its efficiency, PDIPM remains a compute-intensive algorithm. Fortunately, graphics processing units (GPUs) have the potential to meet this requirement. However, their peculiar architecture entails a positive relationship between problem density and speedup, conversely implying a limited affinity of GPUs for problem sparsity. To overcome this difficulty, the state-of-the-art hybrid (CPU-GPU) implementation of PDIPM exploits presence of supernodes in sparse matrices during factorization. Supernodes are groups of similar columns that can be treated as dense submatrices. Factorization method used in the state-of-the-art solver performs only selected operations related to large supernodes on GPU. This method is known to underutilize GPU's computational power while increasing CPU-GPU communication overhead. These shortcomings encouraged us to adapt another factorization method, which processes sets of related supernodes on GPU, and introduce it to the PDIPM implementation of a popular open-source solver. Our adaptation enabled the factorization method to better mitigate the effects of round-off errors accumulated over multiple iterations of PDIPM. To augment performance gains, we also used an efficient CPU-based matrix multiplication method. When tested for a set of well-known sparse problems, the adapted solver showed average speed-ups of approximately 55X, 1.14X and 1.05X over the open-source solver's original version, the state-of-the-art solver, and a highly optimized proprietary solver known as CPLEX, respectively. These results strongly indicate that our proposed hybrid approach can lead to significant performance gains for solving large sparse problems.

show abstract

Section: B Hyb-scf Vs Cpu-cplex and Hyb-maggionimentioning

confidence: 87%

On the Efficiency of Supernodal Factorization in Interior-Point Method Using CPU-GPU Collaboration

et al. 2020

View full text Add to dashboard Cite

show abstract

“…In this work, we use the systolic array architecture for the Gramian matrix computation. A systolic array architecture is produced by the interconnection of a set of attached data processing units (DPU) in a regular way [32], [33]. In parallel, each unit or cell receives data from its upstream neighbors to calculate a part of the result.…”

Section: ) Systolic Array Architecture For Gramian Matrix Computationmentioning

confidence: 99%

Implementation of the RN Method on FPGA using Xilinx System Generator for Nonlinear System Regression

Sayehi¹,

Touali²,

Saidani³

et al. 2017

ijacsa

View full text Add to dashboard Cite

Abstract-In this paper, we propose a new approach aiming to ameliorate the performances of the regularization networks (RN) method and speed up its computation time. A considerable rapidity in totaling calculation time and high performance were accomplished through conveying difficult calculation charges to FPGA. Using Xilinx System Generator, a successful HW/SW CoDesign was constructed to accelerate the Gramian matrix computation. Experimental results involving two real data sets of Wiener-Hammerstein benchmark with process noise prove the efficiency of the approach. The implementation results demonstrate the efficiency of the heterogeneous architecture, presenting a speed-up factor of 40-50 orders of time, comparing to the CPU simulation.

show abstract

“…For example, the performance of sparse matrix-vector multiplications (SpMV) on GPU has a strong dependence on the input sparse matrix (Bell and Garland, 2008). Many studies showed the benefits of auto-tuning for SpMV (Reguly and Giles, 2012;Ashari et al, 2014;Liu and Vinter, 2015;Maggioni and Berger-Wolf, 2016). In astrophysics, Ishiyama et al (2009);Ishiyama et al (2012) achieved a good load balance for their massively parallel TreePM code by incorporating onthe-fly measurements for the execution time of each function within the simulation.…”

Section: Introductionmentioning

confidence: 99%

GOTHIC: Gravitational oct-tree code accelerated by hierarchical time step controlling

Miki

Umemura

2017

New Astronomy

View full text Add to dashboard Cite

The tree method is a widely implemented algorithm for collisionless N-body simulations in astrophysics well suited for GPU(s). Adopting hierarchical time stepping can accelerate N-body simulations; however, it is infrequently implemented and its potential remains untested in GPU implementations. We have developed a Gravitational Oct-Tree code accelerated by HIerarchical time step Controlling named GOTHIC, which adopts both the tree method and the hierarchical time step. The code adopts some adaptive optimizations by monitoring the execution time of each function on-the-fly and minimizes the time-to-solution by balancing the measured time of multiple functions. Results of performance measurements with realistic particle distribution performed on NVIDIA Tesla M2090, K20X, and GeForce GTX TITAN X, which are representative GPUs of the Fermi, Kepler, and Maxwell generation of GPUs, show that the hierarchical time step achieves a speedup by a factor of around 3-5 times compared to the shared time step. The measured elapsed time per step of GOTHIC is 0.30 s or 0.44 s on GTX TITAN X when the particle distribution represents the Andromeda galaxy or the NFW sphere, respectively, with 2 24 = 16,777,216 particles. The averaged performance of the code corresponds to 10-30% of the theoretical single precision peak performance of the GPU.

show abstract

Optimization techniques for sparse matrix–vector multiplication on GPUs

Cited by 18 publications

References 35 publications

On the Efficiency of Supernodal Factorization in Interior-Point Method Using CPU-GPU Collaboration

On the Efficiency of Supernodal Factorization in Interior-Point Method Using CPU-GPU Collaboration

Implementation of the RN Method on FPGA using Xilinx System Generator for Nonlinear System Regression

GOTHIC: Gravitational oct-tree code accelerated by hierarchical time step controlling

Contact Info

Product

Resources

About