Communication optimal parallel multiplication of sparse random matrices

Ballard, Grey; Buluç, Aydın; Demmel, James; Grigori, Laura; Lipshitz, Benjamin; Schwartz, Oded; Toledo, Sivan

doi:10.1145/2486159.2486196

Cited by 57 publications

(42 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Hence, the FLOP-to-communication ratio decreases as 1/ √ P in these weak scaling experiments. This observation can also be shown theoretically [22] The bottleneck is more pronounced on networks where the effective bandwidth decreases with increasing P . Similar observations hold for indexing overhead, which also becomes more important as the number of multiplication steps increases.…”

Section: Performance Limitssupporting

confidence: 71%

Sparse matrix multiplication: The distributed block-compressed sparse row library

et al. 2014

View full text Add to dashboard Cite

Section: Performance Limitssupporting

confidence: 71%

Sparse matrix multiplication: The distributed block-compressed sparse row library

et al. 2014

View full text Add to dashboard Cite

“…T AB: Optimizing sparse matrixmatrix multiplication is an active area of research [17], [18]; state-of-the-art implementations are bound by the memory bandwidth and heavily underutilize the compute resources.…”

Section: ) Optimizing Res = Amentioning

confidence: 99%

A Multi-Platform Evaluation of the Randomized CX Low-Rank Matrix Factorization in Spark

Gittens

Kottalam

Yang

et al. 2016

2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

View full text Add to dashboard Cite

Abstract-We investigate the performance and scalability of the randomized CX low-rank matrix factorization and demonstrate its applicability through the analysis of a 1TB mass spectrometry imaging (MSI) dataset, using Apache Spark on an Amazon EC2 cluster, a Cray XC40 system, and an experimental Cray cluster. We implemented this factorization both as a parallelized C implementation with hand-tuned optimizations and in Scala using the Apache Spark highlevel cluster computing framework. We obtained consistent performance across the three platforms: using Spark we were able to process the 1TB size dataset in under 30 minutes with 960 cores on all systems, with the fastest times obtained on the experimental Cray cluster. In comparison, the C implementation was 21X faster on the Amazon EC2 system, due to careful cache optimizations, bandwidth-friendly access of matrices and vector computation using SIMD units. We report these results and their implications on the hardware and software issues arising in supporting data-centric workloads in parallel and distributed environments.

show abstract

“…Parallelisation and indexing techniques for sparse matrices multiplication were implemented by B u l u c and G i l b e r t [7]. The communication overhead problem of sparse matrices multiplication was solved by B a l l a r d et al [8]. The parallelisation technique for sparse tensor matrix multiplication was proposed by S m i t h et al [9].…”

Section: Introductionmentioning

confidence: 99%

“…The parallelisation technique for sparse tensor matrix multiplication was proposed by S m i t h et al [9]. The above approaches [7][8][9] are not suitable for Big Data applications. Proper care should be taken by the programmer regarding the data distribution, replication, load balancing, communication overhead etc.…”

Section: Introductionmentioning

confidence: 99%