How Data Volume Affects Spark Based Data Analytics on a Scale-up Server

Awan, Ahsan Javed; Brorsson, Mats; Vlassov, Vladimir; Ayguadé, Eduard

doi:10.1007/978-3-319-29006-5_7

Cited by 12 publications

(17 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Table 1 provides a categorization of Spark parameters. In this work, we target parameters belonging to the Shuffle Behavior and Compression and Serialization aspects, which greatly contribute to a Spark application's running time, as supported by our experimental results, the official documentation, and the evidence provided in other works, such as [4,5,3]. Note that there are several other parameters belonging to categories such as Application Properties, Execution Behavior and Networking that may affect the performance, but these parameters are typically set at the cluster level, i.e., they are common to all applications running on the same cluster of machines, e.g., as shown in [6].…”

Section: Spark Basicsmentioning

confidence: 83%

“…Therefore, tuning arbitrary Spark applications by inexpensively navigating through the vast search space of all possible configurations in a principled manner is a challenging task. Very few research endeavors focus on issues related to understanding the performance of Spark applications and the role of tunable parameters [4,5,6]. For the latter, Spark's official configuration guides 1 and tuning 2 guides and tutorial book [7] provide a valuable asset in understanding the role of every single parameter.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Methodology for Spark Parameter Tuning

Gounaris

Torres

2018

Big Data Research

View full text Add to dashboard Cite

Spark has been established as an attractive platform for big data analysis, since it manages to hide most of the complexities related to parallelism, fault tolerance and cluster setting from developers. However, this comes at the expense of having over 150 configurable parameters, the impact of which cannot be exhaustively examined due to the exponential amount of their combinations. The default values allow developers to quickly deploy their applications but leave the question as to whether performance can be improved open. In this work, we investigate the impact of the most important tunable Spark parameters with regards to shuffling, compression and serialization on the application performance through extensive experimentation using the Spark-enabled Marenostrum III (MN3) computing infrastructure of the Barcelona Supercomputing Center. The overarching aim is to guide developers on how to proceed to changes to the default values. We build upon our previous work, where we mapped our experience to a trial-and-error iterative improvement methodology for tuning parameters in arbitrary applications based on evidence from a very small number of experimental runs. The main contribution of this work is that we propose an alternative systematic methodology for parameter tuning, which can be easily applied onto any computing infrastructure and is shown to yield comparable if not better results than the initial one when applied to MN3; observed speedups in our validating test case studies start from 20%. In addition, the new methodology can rely on runs using samples instead of runs on the complete datasets, which render it significantly more practical.

show abstract

Section: Spark Basicsmentioning

confidence: 83%

Section: Introductionmentioning

confidence: 99%

A Methodology for Spark Parameter Tuning

Gounaris

Torres

2018

Big Data Research

View full text Add to dashboard Cite

show abstract

“…According to its authors, the system has proven to be highly scalable and fault tolerant. However, in most Java-based Map-Reduce platforms [36] the deep component stack and its dependence on the JVM entails a significant memory consumption that also affects execution time because of frequent garbage collection operations [37], [38], and serialization if bindings to other languages are used [39]. A performance comparison between Hadoop and Spark frameworks in terms of CPU, memory and I/O usage is presented in [40].…”

Section: B Data-centric Batch and Stream Processingmentioning

confidence: 99%

Toward High-Performance Computing and Big Data Analytics Convergence: The Case of Spark-DIY

et al. 2019

View full text Add to dashboard Cite

Convergence between high-performance computing (HPC) and big data analytics (BDA) is currently an established research area that has spawned new opportunities for unifying the platform layer and data abstractions in these ecosystems. This work presents an architectural model that enables the interoperability of established BDA and HPC execution models, reflecting the key design features that interest both the HPC and BDA communities, and including an abstract data collection and operational model that generates a unified interface for hybrid applications. This architecture can be implemented in different ways depending on the process-and data-centric platforms of choice and the mechanisms put in place to effectively meet the requirements of the architecture. The Spark-DIY platform is introduced in the paper as a prototype implementation of the architecture proposed. It preserves the interfaces and execution environment of the popular BDA platform Apache Spark, making it compatible with any Spark-based application and tool, while providing efficient communication and kernel execution via DIY, a powerful communication pattern library built on top of MPI. Later, Spark-DIY is analyzed in terms of performance by building a representative use case from the hydrogeology domain, EnKF-HGS. This application is a clear example of how current HPC simulations are evolving toward hybrid HPC-BDA applications, integrating HPC simulations within a BDA environment. INDEX TERMS Big data analytics, high performance computing, spark, DIY, MPI.

show abstract

“…Many packages 9 have also been contributed to Apache Spark from both academia and industry. Furthermore, the creators of Apache Spark founded Databricks, 10 a company which is closely involved in the development of Apache Spark.…”

Section: Overview Of Apache Sparkmentioning

confidence: 99%

“…Other works compare Apache Spark with other frameworks such as MapReduce [72], study the performance of Apache Spark for specific scenarios such as scale-up configuration [10], analyze the performance of Spark's programming model for large-scale data analytics [78] and identify the performance bottlenecks in Apache Spark [66] [11]. In addition, as Apache Spark offers language-integrated APIs, there are some efforts to provide the APIs in other languages.…”

Section: Related Researchmentioning

confidence: 99%

Big data analytics on Apache Spark

Salloum

Dautov

Chen

et al. 2016

Int J Data Sci Anal

327

121

View full text Add to dashboard Cite

Apache Spark has emerged as the de facto framework for big data analytics with its advanced in-memory programming model and upper-level libraries for scalable machine learning, graph analysis, streaming and structured data processing. It is a general-purpose cluster computing framework with language-integrated APIs in Scala, Java, Python and R. As a rapidly evolving open source project, with an increasing number of contributors from both academia and industry, it is difficult for researchers to comprehend the full body of development and research behind Apache Spark, especially those who are beginners in this area. In this paper, we present a technical review on big data analytics using Apache Spark. This review focuses on the key components, abstractions and features of Apache Spark. More specifically, it shows what Apache Spark has for designing and implementing big data algorithms and pipelines for machine learning, graph analysis and stream processing. In addition, we highlight some research and development directions on Apache Spark for big data analytics.

show abstract

How Data Volume Affects Spark Based Data Analytics on a Scale-up Server

Cited by 12 publications

References 18 publications

A Methodology for Spark Parameter Tuning

A Methodology for Spark Parameter Tuning

Toward High-Performance Computing and Big Data Analytics Convergence: The Case of Spark-DIY

Big data analytics on Apache Spark

Contact Info

Product

Resources

About