Two-dimensional cache-oblivious sparse matrix–vector multiplication

Yzelman, Albert-Jan N.; Bisseling, Rob H.

doi:10.1016/j.parco.2011.08.004

Cited by 31 publications

(23 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Applying partitioning to minimise communication between computing cores is not enough, as data access patterns of the input vector are not improved while bandwidth becomes more limited as more cores are involved in the computation. Future work should be directed towards combining communication minimisation with methods to enhance cache use, for example by permuting of the local input matrix representations , by adapting the sparse matrix storage scheme or both.…”

Section: Discussionmentioning

confidence: 99%

“…Because these factors are not constant, the BSP model does not make very accurate predictions on the run-more limited as more cores are involved in the computation. Future work should be directed towards combining communication minimisation with methods to enhance cache use, for example by permuting of the local input matrix representations [18,21], by adapting the sparse matrix storage scheme [27][28][29] or both. 27.…”

mentioning

confidence: 99%

See 1 more Smart Citation

An object‐oriented bulk synchronous parallel library for multicore programming

Yzelman

Bisseling

2011

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

SUMMARYWe show that the bulk synchronous parallel (BSP) model, originally designed for distributed-memory systems, is also applicable for shared-memory multicore systems and, furthermore, that BSP libraries are useful in scientific computing on these systems. A proof-of-concept MulticoreBSP library has been implemented in Java, and is used to show that BSP algorithms can attain proper speedups on multicore architectures. This library is based on the BSPlib implementation, adapted to an object-oriented setting. In comparison, the number of function primitives is reduced, while the overall design simplicity is improved. We detail applying the BSP model and library on the sparse matrix-vector (SpMV) multiplication problem, and show by performing numerical experiments that the resulting BSP SpMV algorithm attains speedups, in one case reaching a speedup of 3.5 for 4 threads. Whereas not described in detail in this paper, algorithms for the fast Fourier transform and the dense LU decomposition are also investigated; in one case, attaining superlinear speedups of 5 for 4 threads. The predictability of BSP algorithms in the case of the SpMV is also investigated.

show abstract

Section: Discussionmentioning

confidence: 99%

mentioning

confidence: 99%

An object‐oriented bulk synchronous parallel library for multicore programming

Yzelman

Bisseling

2011

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

show abstract

“…1 for a visual example of it. However, unlike CSB (Buluç et al [8]) the sparse blocks dimensions are not uniform, and unlike Yzelman and Bisseling's ( [9]) our techniques are not hyper-graph based. Similarly to other approaches, selection of a data structure for blocks occurs, but without using completely novel formats, as Kourtis et al [10] do with CSX or as Belgin et al [11] do with PBR.…”

Section: Introduction and Related Literaturementioning

confidence: 94%

Efficient multithreaded untransposed, transposed or symmetric sparse matrix–vector multiplication with the Recursive Sparse Blocks format

Martone

2014

Parallel Computing

View full text Add to dashboard Cite

In earlier work we have introduced the "Recursive Sparse Blocks" (RSB) sparse matrix storage scheme oriented towards cache efficient matrix-vector multiplication (SpMV ) and triangular solution (SpSV ) on cache based shared memory parallel computers. Both the transposed (SpMV T ) and symmetric (SymSpMV ) matrix-vector multiply variants are supported. RSB stands for a meta-format: it recursively partitions a rectangular sparse matrix in quadrants; leaf submatrices are stored in an appropriate traditional formateither Compressed Sparse Rows (CSR) or Coordinate (COO). In this work, we compare the performance of our RSB implementation of SpMV, SpMV T, SymSpMV to that of the state-of-the-art Intel Math Kernel Library (MKL) CSR implementation on the recent Intel's Sandy Bridge processor. Our results with a few dozens of real world large matrices suggest the efficiency of the approach: in all of the cases, RSB's SymSpMV (and in most cases, SpMV T as well) took less than half of MKL CSR's time; SpMV 's advantage was smaller. Furthermore, RSB's SpMV T is more scalable than MKL's CSR, in that it performs almost as well as SpMV. Additionally, we include comparisons to the state-of-the art format Compressed Sparse Blocks (CSB) implementation. We observed RSB to be slightly superior to CSB in SpMV T, slightly inferior in SpMV, and better (in most cases by a factor of two or more) in SymSpMV. Although RSB is a non-traditional storage format and thus needs a special constructor, it can be assembled from CSR or any other similar rowordered representation arrays in the time of a few dozens of matrix-vector multiply executions. Thanks to its significant advantage over MKL's CSR routines for symmetric or transposed matrix-vector multiplication, in most of the observed cases the assembly cost has been observed to amortize with fewer than fifty iterations.

show abstract

“…In recent years, the compressed sparse row (CSR) technology is very popular in finite element analysis . It can significantly reduce the memory requirements through only storing the non‐zeros of stiffness matrix.…”

Section: Multilayer and Multigrain Parallel Computing Approachmentioning

confidence: 99%

“…In recent years, the compressed sparse row (CSR) technology is very popular in finite element analysis. 31 It can significantly reduce the memory requirements through only storing the non-zeros of stiffness matrix. If we can use the CSR format to store structure stiffness matrix instead of the Skyline format, then the memory requirements will be considerably reduced.…”

Section: Solution To Limited Storage Of Mic Cardmentioning

confidence: 99%

An approach to enhance the performance of large‐scale structural analysis on CPU‐MIC heterogeneous clusters

Miao

Jin

Ding

2016

Concurrency and Computation

View full text Add to dashboard Cite

Summary Clusters with the CPU‐MIC heterogeneous architecture are becoming more popular in recent years. However, it is not easy to achieve good performance on such machines. The key challenge has been the asymmetry within clusters, arising from different kinds of execution units as well as different communication latencies. To improve the performance of large‐scale structural analysis on CPU‐MIC heterogeneous clusters, a multi‐layer and multi‐grain collaborative parallel computing approach is proposed in the paper. The proposed method combines the parallel algorithm and the hardware architecture of CPU‐MIC heterogeneous clusters together. Through mapping computing tasks to various hardware layers, it both resolves the load balance problem between CPU and MIC devices and significantly reduces the communication overheads of the system. Numerical experiments conducted on Tianhe‐2 supercomputer show that the proposed method obtained better performance compared with the traditional approach. Scalability investigation showed that the proposed method had good scalability with respect to problem sizes. The findings of this paper are of help to the parallel porting and performance optimization of other applications on CPU‐MIC heterogeneous clusters.

show abstract

Two-dimensional cache-oblivious sparse matrix–vector multiplication

Cited by 31 publications

References 17 publications

An object‐oriented bulk synchronous parallel library for multicore programming

An object‐oriented bulk synchronous parallel library for multicore programming

Efficient multithreaded untransposed, transposed or symmetric sparse matrix–vector multiplication with the Recursive Sparse Blocks format

An approach to enhance the performance of large‐scale structural analysis on CPU‐MIC heterogeneous clusters

Contact Info

Product

Resources

About