Towards a fast parallel sparse symmetric matrix–vector multiplication

Geus, Roman; Röllin, Stefan

doi:10.1016/s0167-8191(01)00073-4

Cited by 38 publications

(37 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Bell and Garland consider several methods, including a variation of ELLPACK that differs from ours [3]. They split the storage between an ELLPACK and coordinate format to reduce its footprint, a novel variant of other previously proposed splitting methods [9,13,20]. At the same time, Baskaran and Bordawekar proposed a general compile-and run-time infrastructure, evaluated for SpMV [2].…”

Section: Related Researchmentioning

confidence: 99%

Model-driven autotuning of sparse matrix-vector multiply on GPUs

Choi

Singh

Vuduc

2010

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

221

104

View full text Add to dashboard Cite

We present a performance model-driven framework for automated performance tuning (autotuning) of sparse matrix-vector multiply (SpMV) on systems accelerated by graphics processing units (GPU). Our study consists of two parts.First, we describe several carefully hand-tuned SpMV implementations for GPUs, identifying key GPU-specific performance limitations, enhancements, and tuning opportunities. These implementations, which include variants on classical blocked compressed sparse row (BCSR) and blocked ELLPACK (BELLPACK) storage formats, match or exceed state-of-the-art implementations. For instance, our best BELLPACK implementation achieves up to 29.0 Gflop/s in single-precision and 15.7 Gflop/s in doubleprecision on the NVIDIA T10P multiprocessor (C1060), enhancing prior state-of-the-art unblocked implementations (Bell and Garland, 2009) by up to 1.8× and 1.5× for single-and doubleprecision respectively.However, achieving this level of performance requires input matrix-dependent parameter tuning. Thus, in the second part of this study, we develop a performance model that can guide tuning. Like prior autotuning models for CPUs (e.g., Im, Yelick, and Vuduc, 2004), this model requires offline measurements and run-time estimation, but more directly models the structure of multithreaded vector processors like GPUs. We show that our model can identify the implementations that achieve within 15% of those found through exhaustive search.

show abstract

Section: Related Researchmentioning

confidence: 99%

Model-driven autotuning of sparse matrix-vector multiply on GPUs

Choi

Singh

Vuduc

2010

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

221

104

View full text Add to dashboard Cite

show abstract

“…Furthermore, storage of the entire H is not necessary, significantly reducing storage requirements. The sparsity and highly structured form of A matrices would suggest the use of sparse iterative techniques, yet research has shown the A matrix to be a worst case scenario in many respects, that elicits poor performance from all standard iterative methods [26]- [28]. To this end, the algorithms developed in ScalIT address the shortcomings, and have proven to be effective and highly parallelizable.…”

Section: The "A Matrix" Form and Scalit Methodologymentioning

confidence: 99%

Using <i>ScalIT</i> for Performing Accurate Rovibrational Spectroscopy Calculations for Triatomic Molecules: A Practical Guide

Petty¹,

Poirier²

2014

View full text Add to dashboard Cite

This paper presents a practical guide for use of the ScalIT software package to perform highly accurate bound rovibrational spectroscopy calculations for triatomic molecules. At its core, ScalIT serves as a massively scalable iterative sparse matrix solver, while assisting modules serve to create rovibrational Hamiltonian matrices, and analyze computed energy levels (eigenvalues) and wavefunctions (eigenvectors). Some of the methods incorporated into the package include: phase space optimized discrete variable representation, preconditioned inexact spectral transform, and optimal separable basis preconditioning. ScalIT has previously been implemented successfully for a wide range of chemical applications, allowing even the most state-of-the-art calculations to be computed with relative ease, across a large number of computational cores, in a short amount of time.

show abstract

“…Other tuning techniques include diagonal cache blocking [30], the detection of diagonal substructures [11], the exploitation of symmetries [21], and optimizations for specific higher-level kernels, such as sparse triangular solve [34]. The impact of prefetching on SMVM performance was previously explored in [31] and [37].…”

Section: Related Workmentioning

confidence: 99%

A Library for Pattern-based Sparse Matrix Vector Multiply

Belgin

Back

Ribbens

2010

Int J Parallel Prog

View full text Add to dashboard Cite

Pattern-based Representation (PBR) is a novel approach to improving the performance of Sparse Matrix-Vector Multiply (SMVM) numerical kernels. Motivated by our observation that many matrices can be divided into blocks that share a small number of distinct patterns, we generate custom multiplication kernels for frequently recurring block patterns. The resulting reduction in index overhead significantly reduces memory bandwidth requirements and improves performance. Unlike existing methods, PBR requires neither detection of dense blocks nor zero filling, making it particularly advantageous for matrices that lack dense nonzero concentrations. SMVM kernels for PBR can benefit from explicit prefetching and vectorization, and are amenable to parallelization. The analysis and format conversion to PBR is implemented as a library, making it suitable for applications that generate matrices dynamically at runtime. We present sequential and parallel performance results for PBR on two current multicore architectures, which show that PBR outperforms available alternatives for the matrices to which it is applicable, and that the analysis and conversion overhead is amortized in realistic application scenarios.

show abstract

Towards a fast parallel sparse symmetric matrix–vector multiplication

Cited by 38 publications

References 5 publications

Model-driven autotuning of sparse matrix-vector multiply on GPUs

Model-driven autotuning of sparse matrix-vector multiply on GPUs

Using <i>ScalIT</i> for Performing Accurate Rovibrational Spectroscopy Calculations for Triatomic Molecules: A Practical Guide

A Library for Pattern-based Sparse Matrix Vector Multiply

Contact Info

Product

Resources

About

Towards a fast parallel sparse symmetric matrix–vector multiplication

Cited by 38 publications

References 5 publications

Model-driven autotuning of sparse matrix-vector multiply on GPUs

Model-driven autotuning of sparse matrix-vector multiply on GPUs

Using &lt;i&gt;ScalIT&lt;/i&gt; for Performing Accurate Rovibrational Spectroscopy Calculations for Triatomic Molecules: A Practical Guide

A Library for Pattern-based Sparse Matrix Vector Multiply

Contact Info

Product

Resources

About

Using <i>ScalIT</i> for Performing Accurate Rovibrational Spectroscopy Calculations for Triatomic Molecules: A Practical Guide