Matrix factorizations at scale: A comparison of scientific data analytics in spark and C+MPI using three case studies

Gittens, Alex; Devarakonda, Aditya; Racah, Evan; Ringenburg, Michael F.; Gerhardt, L.; Kottalam, Jey; Liu, Jialin; Maschhoff, Kristyn J.; Canon, Shane; Chhugani, Jatin; Sharma, Pramod Kumar; Yang, Jiyan; Demmel, James; Harrell, Jim; Krishnamurthy, V.; Mahoney, Michael W.; Prabhat,

doi:10.1109/bigdata.2016.7840606

Cited by 42 publications

(52 citation statements)

References 41 publications

(40 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The results showed that MPI/OpenMP outperforms Spark by more than one order of magnitude in terms of processing speed while Spark has advantages in some other aspects, such as data management infrastructure and fault tolerance. [28] implemented 3 matrix kernels on Spark and the comparisons with C+MPI implementations showed a performance gap of 10x -40x without I/O. [29] proposed a system for integrating MPI with Spark and achieved 3.1-17.7x speedups on four graph and machine learning applications.…”

Section: Discussionmentioning

confidence: 99%

Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Lin

Meng

et al. 2017

Int J Parallel Prog

View full text Add to dashboard Cite

Abstract. Metagenomics, the study of all microbial species cohabitants in an environment, often produces large amount of sequence data varying from several GBs to a few TBs. Analysing metagenomics data involving several steps, some steps are data intensive, and some are compute intensive. Typical bioinformatics pipelines attempt to analyse the entire data set on computer servers with several terabytes of RAM, which is very inefficient. To overcome this limit, here we propose a MapReduce based solution to partition the data based on their species of origin. We implemented the solution using BioPig, an analytic toolkit for large-scale genomic sequence data based on Apache Hadoop and Pig. We simplified data types and logic design, compressed k-mer storage and combined Hadoop with MPI to improve the computational performance. After these optimizations, we achieved up to 193x speedup for the rate-limiting step and 8x speedup for the entire pipeline, respectively. The optimized software is also capable to process datasets that are 16 times larger on the same hardware platform. Results from this case study suggest the combined Hadoop with MPI approach has great potential in large genomics applications that are both data-intensive and compute-intensive.

show abstract

Section: Discussionmentioning

confidence: 99%

Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Lin

Meng

et al. 2017

Int J Parallel Prog

View full text Add to dashboard Cite

show abstract

“…It is evident that the overheads take up approximately 20% of the total runtime. Comparing this to the overheads encountered by Spark, which are orders of magnitude larger than the actual compute times (see the discussion in the work of Gittens et al), we see that there is a significant improvement in performance when using Alchemist to perform the SVD computations. This difference is further highlighted in Figure , which shows an enormous performance difference, including that Spark was unable to complete the SVD computations in the allotted time for all but the smallest matrix.…”

Section: Methodsmentioning

confidence: 99%

“…While non‐negligible, the overheads constitute just 20% of the overall running time of the truncated SVD procedure. This is a significant improvement on the overheads incurred by Spark (see the discussion in the work of Gittens et al)…”

Section: Methodsmentioning

confidence: 99%

“…This is a ubiquitous matrix factorization, used in fields such as climate modeling, genetics, neuroscience, and mathematical finance (among many others), but it is particularly challenging for Spark because the iterative nature of SVD algorithms leads to substantial communication and synchronization overheads. The results of an extensive and detailed evaluation (see figures 5 and 6 in the work of Gittens et al) show not only that Spark is more than an order of magnitude slower than the equivalent procedure implemented using C+MPI for datasets in the 10 TB size range, but also that Spark's overheads in fact dominate and anti‐scale, ie, the overheads take up an increasing amount of the computational workload relative to the actual matrix factorization calculations as the number of nodes used increases …”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Alchemist: An Apache Spark ⇔ MPI interface

Gittens

Rothauge

Wang

et al. 2018

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

Summary The Apache Spark framework for distributed computation is popular in the data analytics community due to its ease of use, but its MapReduce‐style programming model can incur significant overheads when performing computations that do not map directly onto this model. One way to mitigate these costs is to off‐load computations onto MPI codes. In recent work, we introduced Alchemist, a system for the analysis of large‐scale data sets. Alchemist calls MPI‐based libraries from within Spark applications, and it has minimal coding, communication, and memory overheads. In particular, Alchemist allows users to retain the productivity benefits of working within the Spark software ecosystem without sacrificing performance efficiency in linear algebra, machine learning, and other related computations. In this paper, we discuss the motivation behind the development of Alchemist, and we provide a detailed overview of its design and usage. We also demonstrate the efficiency of our approach on medium‐to‐large data sets, using some standard linear algebra operations, namely, matrix multiplication and the truncated singular value decomposition of a dense matrix, and we compare the performance of Spark with that of Spark+Alchemist. These computations are run on the NERSC supercomputer Cori Phase 1, a Cray XC40.

show abstract

“…Some of the applications investigated in these case studies include distributed graph analytics [21], and k-nearest neighbors and support vector machines [16]. However, it is our recent empirical evaluations [4] that serve as the main motivation for the development of Alchemist. Our results in [4] illustrate the difference in computing times when performing certain matrix factorizations in Apache Spark, compared to using MPI-based routines written in C or C++ (C+MPI).…”

Section: Introductionmentioning

confidence: 99%

Accelerating Large-Scale Data Analysis by Offloading to High-Performance Computing Libraries using Alchemist

Gittens

Rothauge

Wang

et al. 2018

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery &Amp; Data Mining

Self Cite

View full text Add to dashboard Cite

Apache Spark is a popular system aimed at the analysis of large data sets, but recent studies have shown that certain computations-in particular, many linear algebra computations that are the basis for solving common machine learning problems-are significantly slower in Spark than when done using libraries written in a high-performance computing framework such as the Message-Passing Interface (MPI).To remedy this, we introduce Alchemist, a system designed to call MPI-based libraries from Apache Spark. Using Alchemist with Spark helps accelerate linear algebra, machine learning, and related computations, while still retaining the benefits of working within the Spark environment. We discuss the motivation behind the development of Alchemist, and we provide a brief overview of its design and implementation.We also compare the performances of pure Spark implementations with those of Spark implementations that leverage MPI-based codes via Alchemist. To do so, we use data science case studies: a large-scale application of the conjugate gradient method to solve very large linear systems arising in a speech classification problem, where we see an improvement of an order of magnitude; and the truncated singular value decomposition (SVD) of a 400GB three-dimensional ocean temperature data set, where we see a speedup of up to 7.9x. We also illustrate that the truncated SVD computation is easily scalable to terabyte-sized data by applying it to data sets of sizes up to 17.6TB.

show abstract

Matrix factorizations at scale: A comparison of scientific data analytics in spark and C+MPI using three case studies

Cited by 42 publications

References 41 publications

Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Alchemist: An Apache Spark ⇔ MPI interface

Accelerating Large-Scale Data Analysis by Offloading to High-Performance Computing Libraries using Alchemist

Contact Info

Product

Resources

About