2016 IEEE International Conference on Big Data (Big Data) 2016
DOI: 10.1109/bigdata.2016.7840606
|View full text |Cite
|
Sign up to set email alerts
|

Matrix factorizations at scale: A comparison of scientific data analytics in spark and C+MPI using three case studies

Abstract: Abstract-We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausability), PCA (for its ubiquity) and CX (for data interpretability). We apply these methods to 1.6TB particle physics, 2.2TB and 16TB cli… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
52
0

Year Published

2017
2017
2021
2021

Publication Types

Select...
6
1

Relationship

2
5

Authors

Journals

citations
Cited by 42 publications
(52 citation statements)
references
References 41 publications
(40 reference statements)
0
52
0
Order By: Relevance
“…The results showed that MPI/OpenMP outperforms Spark by more than one order of magnitude in terms of processing speed while Spark has advantages in some other aspects, such as data management infrastructure and fault tolerance. [28] implemented 3 matrix kernels on Spark and the comparisons with C+MPI implementations showed a performance gap of 10x -40x without I/O. [29] proposed a system for integrating MPI with Spark and achieved 3.1-17.7x speedups on four graph and machine learning applications.…”
Section: Discussionmentioning
confidence: 99%
“…The results showed that MPI/OpenMP outperforms Spark by more than one order of magnitude in terms of processing speed while Spark has advantages in some other aspects, such as data management infrastructure and fault tolerance. [28] implemented 3 matrix kernels on Spark and the comparisons with C+MPI implementations showed a performance gap of 10x -40x without I/O. [29] proposed a system for integrating MPI with Spark and achieved 3.1-17.7x speedups on four graph and machine learning applications.…”
Section: Discussionmentioning
confidence: 99%
“…It is evident that the overheads take up approximately 20% of the total runtime. Comparing this to the overheads encountered by Spark, which are orders of magnitude larger than the actual compute times (see the discussion in the work of Gittens et al), we see that there is a significant improvement in performance when using Alchemist to perform the SVD computations. This difference is further highlighted in Figure , which shows an enormous performance difference, including that Spark was unable to complete the SVD computations in the allotted time for all but the smallest matrix.…”
Section: Methodsmentioning
confidence: 99%
“…While non‐negligible, the overheads constitute just 20% of the overall running time of the truncated SVD procedure. This is a significant improvement on the overheads incurred by Spark (see the discussion in the work of Gittens et al)…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Some of the applications investigated in these case studies include distributed graph analytics [21], and k-nearest neighbors and support vector machines [16]. However, it is our recent empirical evaluations [4] that serve as the main motivation for the development of Alchemist. Our results in [4] illustrate the difference in computing times when performing certain matrix factorizations in Apache Spark, compared to using MPI-based routines written in C or C++ (C+MPI).…”
Section: Introductionmentioning
confidence: 99%