Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications

Ashari, Arash; Sedaghati, Naser; Eisenlohr, John; Parthasarathy, S.; Sadayappan, P.

doi:10.1109/sc.2014.69

Cited by 123 publications

(90 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Their main bottlenecks were the limited size of shared memory, an expensive global scan operation, and random non-coalesced memory accesses. 2 Patidar [29] proposed two methods with a particular focus on a large number of buckets (more than 4k): one based on heavy usage of shared-memory atomic operations (to compute block level histogram and intra-bucket orders), and the other by iterative usage of basic binary split for each bucket (or groups of buckets). Patidar used a combination of these methods in a hierarchical way to get his best results.…”

Section: Multisplit and Histogramsmentioning

confidence: 99%

GPU Multisplit

Ashkiani

Davidson

Meyer

et al. 2017

ACM Trans. Parallel Comput.

View full text Add to dashboard Cite

Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets or bins, where the function that categorizes an element into a bucket is provided by the programmer. Due to the lack of an e cient multisplit on GPUs, programmers often choose to implement multisplit with a sort. One way is to rst generate an auxiliary array of bucket IDs and then sort input data based on it. In case smaller indexed buckets possess smaller valued keys, another way for multisplit is to directly sort input data. Both methods are ine cient and require more work than necessary: the former requires more expensive data movements while the latter spends unnecessary e ort in sorting elements within each bucket. In this work, we provide a parallel model and multiple implementations for the multisplit problem. Our principal focus is multisplit for a small (up to 256) number of buckets. We use warp-synchronous programming models and emphasize warp-wide communications to avoid branch divergence and reduce memory usage. We also hierarchically reorder input elements to achieve better coalescing of global memory accesses. On a GeForce GTX 1080 GPU, we can reach a peak throughput of 18.93 Gkeys/s (or 11.68 Gpairs/s) for a key-only (or key-value) multisplit. Finally, we demonstrate how multisplit can be used as a building block for radix sort. In our multisplit-based sort implementation, we achieve comparable performance to the fastest GPU sort routines, sorting 32-bit keys (and key-value pairs) with a throughput of 3.0 G keys/s (and 2.1 Gpair/s).

show abstract

Section: Multisplit and Histogramsmentioning

confidence: 99%

GPU Multisplit

Ashkiani

Davidson

Meyer

et al. 2017

ACM Trans. Parallel Comput.

View full text Add to dashboard Cite

show abstract

“…On co-processors composed of a large amount of lightweight single instruction, multiple data (SIMD) units, the problem can heavily degrade performance of SpMV operation. Even though many strategies, such as vectorization [1,2,13], data streaming [14], memory coalescing [33], static or dynamic binning [14,15], Dynamic Parallelism [15] and dynamic row distribution [19], have been proposed for the row block method, it is still impossible to achieve nearly perfect load balancing in general sense, simply since row sizes are irregular and unpredictable.…”

Section: Csr Format and Csr-based Spmv Algorithmsmentioning

confidence: 99%

“…Thereofore, improving performance of SpMV using the most widely supported CSR format has also gained plenty of attention [1,2,13,14,15,16,17,18]. Most of the related work [1,2,13,14,15,19] has focused on improving row block method for the CSR-based SpMV.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors

Liu

Vinter

2015

Parallel Computing

View full text Add to dashboard Cite

Sparse matrix-vector multiplication (SpMV) is a central building block for scientific software and graph applications. Recently, heterogeneous processors composed of different types of cores attracted much attention because of their flexible core configuration and high energy efficiency. In this paper, we propose a compressed sparse row (CSR) format based SpMV algorithm utilizing both types of cores in a CPU-GPU heterogeneous processor. We first speculatively execute segmented sum operations on the GPU part of a heterogeneous processor and generate a possibly incorrect results. Then the CPU part of the same chip is triggered to re-arrange the predicted partial sums for a correct resulting vector. On three heterogeneous processors from Intel, AMD and nVidia, using 20 sparse matrices as a benchmark suite, the experimental results show that our method obtains significant performance improvement over the best existing CSR-based SpMV algorithms.

show abstract

“…The SELL-C-σ format has been improved and optimized for GPUs by Antz et al [3], by introducing some zero padding to satisfy the memory constraints of the GPU architecture, hence called the SELL-P format. Ashari et al [4] proposed an adaptive algorithm for SpMV using the CSR format (called ACSR), where additional metadata are used with the standard CSR format that help achieve better GPU utilization. ACSR is mainly proposed for adaptive graph applications, where the structure of the graph adjacency matrix changes frequently, thus making the preprocessing step a serious bottleneck.…”

Section: Related Workmentioning

confidence: 99%

High Performance Multi-GPU SpMV for Multi-component PDE-Based Applications

Abdelfattah

Ltaief

Keyes

2015

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Leveraging optimization techniques (e.g., register blocking and double buffering) introduced in the context of KBLAS, a Level 2 BLAS high performance library on GPUs, the authors implement dense matrix-vector multiplications within a sparse-block structure. While these optimizations are important for high performance dense kernel executions, they are even more critical when dealing with sparse linear algebra operations. The most time-consuming phase of many multicomponent applications, such as models of reacting flows or petroleum reservoirs, is the solution at each implicit time step of large, sparse spatially structured or unstructured linear systems. The standard method is a preconditioned Krylov solver. The Sparse Matrix-Vector multiplication (SpMV) is, in turn, one of the most time-consuming operations in such solvers. Because there is no data reuse of the elements of the matrix within a single SpMV, kernel performance is limited by the speed at which data can be transferred from memory to registers, making the bus bandwidth the major bottleneck. On the other hand, in case of a multi-species model, the resulting Jacobian has a dense block structure. For contemporary petroleum reservoir simulations, the block size typically ranges from three to a few dozen among different models, and still larger blocks are relevant within adaptively model-refined regions of the domain, though generally the size of the blocks, related to the number of conserved species, is constant over large regions within a given model. This structure can be exploited beyond the convenience of a block compressed row data format, because it offers opportunities to hide the data motion with useful computations. The new SpMV kernel outperforms existing state-of-the-art implementations on single and multi-GPUs using matrices with dense block structure representative of porous media applications with both structured and unstructured multi-component grids.

show abstract

Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications

Cited by 123 publications

References 10 publications

GPU Multisplit

GPU Multisplit

Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors

High Performance Multi-GPU SpMV for Multi-component PDE-Based Applications

Contact Info

Product

Resources

About