2016
DOI: 10.1007/978-3-319-29006-5_7
|View full text |Cite
|
Sign up to set email alerts
|

How Data Volume Affects Spark Based Data Analytics on a Scale-up Server

Abstract: Sheer increase in volume of data over the last decade has triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark is gaining popularity for exhibiting superior scale-out performance on the commodity machines, the impact of data volume on the performance of Spark based data analytics in scale-up configuration is not well understood. We present a deep-dive analysis of Spark based applications on a large scale-up server machine. Our … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
17
0

Year Published

2016
2016
2020
2020

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 12 publications
(17 citation statements)
references
References 18 publications
0
17
0
Order By: Relevance
“…Table 1 provides a categorization of Spark parameters. In this work, we target parameters belonging to the Shuffle Behavior and Compression and Serialization aspects, which greatly contribute to a Spark application's running time, as supported by our experimental results, the official documentation, and the evidence provided in other works, such as [4,5,3]. Note that there are several other parameters belonging to categories such as Application Properties, Execution Behavior and Networking that may affect the performance, but these parameters are typically set at the cluster level, i.e., they are common to all applications running on the same cluster of machines, e.g., as shown in [6].…”
Section: Spark Basicsmentioning
confidence: 83%
See 1 more Smart Citation
“…Table 1 provides a categorization of Spark parameters. In this work, we target parameters belonging to the Shuffle Behavior and Compression and Serialization aspects, which greatly contribute to a Spark application's running time, as supported by our experimental results, the official documentation, and the evidence provided in other works, such as [4,5,3]. Note that there are several other parameters belonging to categories such as Application Properties, Execution Behavior and Networking that may affect the performance, but these parameters are typically set at the cluster level, i.e., they are common to all applications running on the same cluster of machines, e.g., as shown in [6].…”
Section: Spark Basicsmentioning
confidence: 83%
“…Therefore, tuning arbitrary Spark applications by inexpensively navigating through the vast search space of all possible configurations in a principled manner is a challenging task. Very few research endeavors focus on issues related to understanding the performance of Spark applications and the role of tunable parameters [4,5,6]. For the latter, Spark's official configuration guides 1 and tuning 2 guides and tutorial book [7] provide a valuable asset in understanding the role of every single parameter.…”
Section: Introductionmentioning
confidence: 99%
“…According to its authors, the system has proven to be highly scalable and fault tolerant. However, in most Java-based Map-Reduce platforms [36] the deep component stack and its dependence on the JVM entails a significant memory consumption that also affects execution time because of frequent garbage collection operations [37], [38], and serialization if bindings to other languages are used [39]. A performance comparison between Hadoop and Spark frameworks in terms of CPU, memory and I/O usage is presented in [40].…”
Section: B Data-centric Batch and Stream Processingmentioning
confidence: 99%
“…Many packages 9 have also been contributed to Apache Spark from both academia and industry. Furthermore, the creators of Apache Spark founded Databricks, 10 a company which is closely involved in the development of Apache Spark.…”
Section: Overview Of Apache Sparkmentioning
confidence: 99%
“…Other works compare Apache Spark with other frameworks such as MapReduce [72], study the performance of Apache Spark for specific scenarios such as scale-up configuration [10], analyze the performance of Spark's programming model for large-scale data analytics [78] and identify the performance bottlenecks in Apache Spark [66] [11]. In addition, as Apache Spark offers language-integrated APIs, there are some efforts to provide the APIs in other languages.…”
Section: Related Researchmentioning
confidence: 99%