Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Williams, Samuel; Oliker, Leonid; Vuduc, Richard; Shalf, John; Yelick, Katherine; Demmel, James

doi:10.1145/1362622.1362674

Cited by 416 publications

(297 citation statements)

References 16 publications

Supporting

Mentioning

285

Contrasting

Unclassified

Order By: Relevance

“…On co-processors composed of a large amount of lightweight single instruction, multiple data (SIMD) units, the problem can heavily degrade performance of SpMV operation. Even though many strategies, such as vectorization [1,2,13], data streaming [14], memory coalescing [33], static or dynamic binning [14,15], Dynamic Parallelism [15] and dynamic row distribution [19], have been proposed for the row block method, it is still impossible to achieve nearly perfect load balancing in general sense, simply since row sizes are irregular and unpredictable.…”

Section: Csr Format and Csr-based Spmv Algorithmsmentioning

confidence: 99%

“…Thereofore, improving performance of SpMV using the most widely supported CSR format has also gained plenty of attention [1,2,13,14,15,16,17,18]. Most of the related work [1,2,13,14,15,19] has focused on improving row block method for the CSR-based SpMV.…”

Section: Introductionmentioning

confidence: 99%

“…Most of the related work [1,2,13,14,15,19] has focused on improving row block method for the CSR-based SpMV. However, these newly proposed approaches are not highly efficient.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors

Liu

Vinter

2015

Parallel Computing

View full text Add to dashboard Cite

Sparse matrix-vector multiplication (SpMV) is a central building block for scientific software and graph applications. Recently, heterogeneous processors composed of different types of cores attracted much attention because of their flexible core configuration and high energy efficiency. In this paper, we propose a compressed sparse row (CSR) format based SpMV algorithm utilizing both types of cores in a CPU-GPU heterogeneous processor. We first speculatively execute segmented sum operations on the GPU part of a heterogeneous processor and generate a possibly incorrect results. Then the CPU part of the same chip is triggered to re-arrange the predicted partial sums for a correct resulting vector. On three heterogeneous processors from Intel, AMD and nVidia, using 20 sparse matrices as a benchmark suite, the experimental results show that our method obtains significant performance improvement over the best existing CSR-based SpMV algorithms.

show abstract

Section: Csr Format and Csr-based Spmv Algorithmsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors

Liu

Vinter

2015

Parallel Computing

View full text Add to dashboard Cite

show abstract

“…The resulting GPU algorithm has been tested with the sparse matrix set used in studies of multi-core [6] and GPU [7] matrix-vector product performance. Figure 5 shows that the convergence of the default SPAI (with sparsity pattern identical to A T ) is highly competitive with the default CUSP-Bridson preconditioner for the GMRES linear solver.…”

Section: Figmentioning

confidence: 99%

SPAI Preconditioners for HPC Applications

Sawyer

Vanini

Fourestey

et al. 2012

Proc Appl Math and Mech

View full text Add to dashboard Cite

Iterative methods to solve systems of linear equations Ax = b usually require preconditioners M to speed convergence. But the calculation of many preconditioners can be notoriously sequential. The sparse approximate inverse preconditioner (SPAI) has particular potential for high performance computing [1]. We have ported the SPAI algorithm to graphical processing units (GPUs) within NVIDIA's CUSP library [2] for sparse linear algebra. GPUs perform well on dense linear algebra problems where data resides for long periods on the device. Since the underlying minimization problems are independent, they are mapped to separate thread-blocks, and an optimized QR algorithm implemented using NVIDIA's CUBLAS library is employed on each.Traditionally the challenge has been to determine a sparsity pattern Sp(M) of the preconditioner dynamically [3], which reduces the condition number of MA to a degree where a preconditioned iterative solver such as GMRES becomes computationally viable. Due to the extremely high performance of the GPU, it is possible to consider initial sparsity patterns much denser than have been previously considered. We therefore consider a fixed sparsity patterns to simplify the GPU implementation.We evaluate the performance of the resulting preconditioner on a standard set of sparse matrices, and compare SPAI to other preconditioners. SPAI OverviewGiven a large, sparse system of equations Ax = b, the sparse approximate inverse technique [3] minimizes the Frobenius norm,which entails solving n decoupled least squares minimization problems, min m k Am k − e k , k = 1, . . . , n. We assume that M has some initial sparsity pattern, e.g., the same sparsity as A. Thus, for each k let J be the set of indices j for which m k (j) = 0, and denote that vector with its compressed representation,m k = m k (J ). Let I be the set of indices i such that A(i, J ) = 0, andÂ = A(I, J ). Finally, defineê k = e k (I). The minimization then becomes the dense problem:The solution to this minimization ism k = R −1 Q Tê k , withÂ = QR, the QR decomposition. The decoupled nature of the minimization problems ensures that all solutionsm k can be found in parallel. Implementation on Graphical Processing UnitsSPAI's underlying numerical kernel is the QR algorithm [4]. Empirical analysis of many sparse matrices indicates that the minimization matricesÂ tend to be "tall and skinny", with rarely more than 200 rows. While there are several GPU QR implementations, e.g., in the MAGMA [5] library, these are optimized for large matrices, not for many independent small matrices. In the former case, the GPU should be applied to the entire matrix. In our case, each small minimization should be mapped to a thread-block.To this end, we have implemented several versions of QR for a single thread-block. Figure 1 clearly indicates that versions which are based on CUSP's own dense linear algebra routines are much less performant than those based on NVIDIA's CUBLAS library, and Householder QR is more effective than Givens QR. Figure 2 indicates that ...

show abstract

“…Although platform specific tuning is known to give significant efficiency improvements (see the study of Williams et al [15]), we chose not to apply it here. In this way we keep RSB algorithms general and the code portable, thus retaining the possibility of further optimizations.…”

Section: Introduction and Related Literaturementioning

confidence: 99%

Efficient multithreaded untransposed, transposed or symmetric sparse matrix–vector multiplication with the Recursive Sparse Blocks format

Martone

2014

Parallel Computing

View full text Add to dashboard Cite

In earlier work we have introduced the "Recursive Sparse Blocks" (RSB) sparse matrix storage scheme oriented towards cache efficient matrix-vector multiplication (SpMV ) and triangular solution (SpSV ) on cache based shared memory parallel computers. Both the transposed (SpMV T ) and symmetric (SymSpMV ) matrix-vector multiply variants are supported. RSB stands for a meta-format: it recursively partitions a rectangular sparse matrix in quadrants; leaf submatrices are stored in an appropriate traditional formateither Compressed Sparse Rows (CSR) or Coordinate (COO). In this work, we compare the performance of our RSB implementation of SpMV, SpMV T, SymSpMV to that of the state-of-the-art Intel Math Kernel Library (MKL) CSR implementation on the recent Intel's Sandy Bridge processor. Our results with a few dozens of real world large matrices suggest the efficiency of the approach: in all of the cases, RSB's SymSpMV (and in most cases, SpMV T as well) took less than half of MKL CSR's time; SpMV 's advantage was smaller. Furthermore, RSB's SpMV T is more scalable than MKL's CSR, in that it performs almost as well as SpMV. Additionally, we include comparisons to the state-of-the art format Compressed Sparse Blocks (CSB) implementation. We observed RSB to be slightly superior to CSB in SpMV T, slightly inferior in SpMV, and better (in most cases by a factor of two or more) in SymSpMV. Although RSB is a non-traditional storage format and thus needs a special constructor, it can be assembled from CSR or any other similar rowordered representation arrays in the time of a few dozens of matrix-vector multiply executions. Thanks to its significant advantage over MKL's CSR routines for symmetric or transposed matrix-vector multiplication, in most of the observed cases the assembly cost has been observed to amortize with fewer than fifty iterations.

show abstract

Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Cited by 416 publications

References 16 publications

Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors

Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors

SPAI Preconditioners for HPC Applications

Efficient multithreaded untransposed, transposed or symmetric sparse matrix–vector multiplication with the Recursive Sparse Blocks format

Contact Info

Product

Resources

About