2020
DOI: 10.1186/s40537-020-00388-5
|View full text |Cite
|
Sign up to set email alerts
|

A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench

Abstract: Big Data analytics for storing, processing, and analyzing large-scale datasets has become an essential tool for the industry. The advent of distributed computing frameworks such as Hadoop and Spark offers efficient solutions to analyze vast amounts of data. Due to the application programming interface (API) availability and its performance, Spark becomes very popular, even more popular than the MapReduce framework. Both these frameworks have more than 150 parameters, and the combination of these parameters has… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
25
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
4
2

Relationship

0
10

Authors

Journals

citations
Cited by 71 publications
(25 citation statements)
references
References 27 publications
0
25
0
Order By: Relevance
“…The processing speed of Hadoop MapReduce is slow since it needs disk access for reads & writes. Spark, on the other hand, stores data in memory, decreasing the read or write cycles (Ahmed et al, 2020). In memory, Spark can run applications up to hundreds of times faster than Hadoop MapReduce, while on disk, it can run applications ten times faster (Al-Barznji and Atanassov, 2018).…”
Section: Apache Sparkmentioning
confidence: 99%
“…The processing speed of Hadoop MapReduce is slow since it needs disk access for reads & writes. Spark, on the other hand, stores data in memory, decreasing the read or write cycles (Ahmed et al, 2020). In memory, Spark can run applications up to hundreds of times faster than Hadoop MapReduce, while on disk, it can run applications ten times faster (Al-Barznji and Atanassov, 2018).…”
Section: Apache Sparkmentioning
confidence: 99%
“…They performed variance analysis on different components of the MapReduce workflow to identify the possible sources of modeling error. Ahmed et al [19] conducted a comparative study of Hadoop and Spark performances using Hibenchmark workloads using different combinations of parametric settings pertaining to resource utilization, input splits and shuffle groups. They used a subset of nine Hadoop parameters and eight Spark parameters for the experiment.…”
Section: Related Workmentioning
confidence: 99%
“…Ahmed et al [10] aim to identify the parameters with the highest impact on the performance of Hadoop and Spark by using trial-and-error approach to tune them under a variety of experimental settings. Their evaluation metrics to measure benchmarked frameworks' performance are execution time, throughput, and speedup.…”
Section: Related Workmentioning
confidence: 99%