A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench

Ahmed, N.; Barczak, Andre L. C.; Sušnjak, Teo; Rashid, M. A.

doi:10.1186/s40537-020-00388-5

Cited by 71 publications

(25 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The processing speed of Hadoop MapReduce is slow since it needs disk access for reads & writes. Spark, on the other hand, stores data in memory, decreasing the read or write cycles (Ahmed et al, 2020). In memory, Spark can run applications up to hundreds of times faster than Hadoop MapReduce, while on disk, it can run applications ten times faster (Al-Barznji and Atanassov, 2018).…”

Section: Apache Sparkmentioning

confidence: 99%

Big Data Processing Frameworks for Handling Huge Data Efficiencies and Challenges: A Survey

Al-Barznji¹

2022

IJDSBDA

View full text Add to dashboard Cite

The increasing expansion of digital data collected from many sources renders traditional storage, processing, and analysis methods obsolete. For these restrictions, new technologies for processing and storing very massive datasets have been developed. Big data processing is required to extract relevant information from it. Transforming data into information and knowledge is what processing implies. Big data processing is the process of dealing with massive amounts of data and changing it from its raw form into useable information in a more understandable manner. As a result, numerous big data processing execution frameworks have emerged, but determining and selecting the appropriate framework for processing your big data applications is a significant challenge. Therefore, this paper investigates the possible influence of big data challenges and discusses in depth the most well-known approaches to big data processing, which are divided into five classes: batch processing, streaming processing, realtime processing, interactive processing, and hybrid processing, as well as the variety of the most popular frameworks associated with them such as Apache Hadood, Dryad, Samza, IBM Infosphere, Storm, Amazon Kinesis, Drill, Impala, Flink, and Spark. Furthermore, this study presents a comparison among the several features of the frameworks by highlighting their drawbacks and strengths. Thus, it can be used as a guideline for picking the best application framework in IT analytics and will help busi-ness users make faster decisions.

show abstract

Section: Apache Sparkmentioning

confidence: 99%

Big Data Processing Frameworks for Handling Huge Data Efficiencies and Challenges: A Survey

Al-Barznji¹

2022

IJDSBDA

View full text Add to dashboard Cite

show abstract

“…They performed variance analysis on different components of the MapReduce workflow to identify the possible sources of modeling error. Ahmed et al [19] conducted a comparative study of Hadoop and Spark performances using Hibenchmark workloads using different combinations of parametric settings pertaining to resource utilization, input splits and shuffle groups. They used a subset of nine Hadoop parameters and eight Spark parameters for the experiment.…”

Section: Related Workmentioning

confidence: 99%

An Analytic solution for the Hadoop Configuration Combinatorial Puzzle based on General Factorial Design

2022

KSII TIIS

View full text Add to dashboard Cite

Big data analytics offers endless opportunities for operational enhancement by extracting valuable insights from complex voluminous data. Hadoop is a comprehensive technological suite which offers solutions for the large scale storage and computing needs of Big data. The performance of Hadoop is closely tied with its configuration settings which depends on the cluster capacity and the application profile. Since Hadoop has over 190 configuration parameters, tuning them to gain optimal application performance is a daunting challenge. Our approach is to extract a subset of impactful parameters from which the performance enhancing sub-optimal configuration is then narrowed down. This paper presents a statistical model to analyze the significance of the effect of Hadoop parameters on a variety of performance metrics. Our model decomposes the total observed performance variation and ascribes them to the main parameters, their interaction effects and noise factors. The method clearly segregates impactful parameters from the rest. The configuration setting determined by our methodology has reduced the Job completion time by 22%, resource utilization in terms of memory and CPU by 15% and 12% respectively, the number of killed Maps by 50% and Disk spillage by 23%. The proposed technique can be leveraged to ease the configuration tuning task of any Hadoop cluster despite the differences in the underlying infrastructure and the application running on it.

show abstract

“…Ahmed et al [10] aim to identify the parameters with the highest impact on the performance of Hadoop and Spark by using trial-and-error approach to tune them under a variety of experimental settings. Their evaluation metrics to measure benchmarked frameworks' performance are execution time, throughput, and speedup.…”

Section: Related Workmentioning

confidence: 99%

Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification

Tekdogan

Çakmak

2021

2021 5th International Conference on Cloud and Big Data Computing (ICCBDC)

View full text Add to dashboard Cite

Most of the popular Big Data analytics tools evolved to adapt their working environment to extract valuable information from a vast amount of unstructured data. The ability of data mining techniques to filter this helpful information from Big Data led to the term 'Big Data Mining'. Shifting the scope of data from small-size, structured, and stable data to huge volume, unstructured, and quickly changing data brings many data management challenges. Different tools cope with these challenges in their own way due to their architectural limitations. There are numerous parameters to take into consideration when choosing the right data management framework based on the task at hand. In this paper, we present a comprehensive benchmark for two widely used Big Data analytics tools, namely Apache Spark and Hadoop MapReduce, on a common data mining task, i.e., classification. We employ several evaluation metrics to compare the performance of the benchmarked frameworks, such as execution time, accuracy, and scalability. These metrics are specialized to measure the performance for classification task. To the best of our knowledge, there is no previous study in the literature that employs all these metrics while taking into consideration task-specific concerns. We show that Spark is 5 times faster than MapReduce on training the model. Nevertheless, the performance of Spark degrades when the input workload gets larger. Scaling the environment by additional clusters significantly improves the performance of Spark. However, similar enhancement is not observed in Hadoop. Machine learning utility of MapReduce tend to have better accuracy scores than that of Spark, like around 2%-3%, even in small-size data sets.

show abstract

A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench

Cited by 71 publications

References 27 publications

Big Data Processing Frameworks for Handling Huge Data Efficiencies and Challenges: A Survey

Big Data Processing Frameworks for Handling Huge Data Efficiencies and Challenges: A Survey

An Analytic solution for the Hadoop Configuration Combinatorial Puzzle based on General Factorial Design

Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification

Contact Info

Product

Resources

About