Performance evaluation of distributed computing environments with Hadoop and Spark frameworks

Taran, Vlad; Alienin, Oleg; Stirenko, Sergii; Gordienko, Yuri; Rojbi, Anis

doi:10.1109/ysf.2017.8126655

Cited by 18 publications

(5 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It also introduces a directed acyclic graph (DAG) task segmentation mechanism to operate on RDD in a way similar to MapReduce. Spark in-memory computing is much faster than Hadoop, which makes Spark the current mainstream batch big data analysis platform [55][56][57] . The pros and cons of Spark in big data analysis will be discussed in the next section.…”

Section: Sparkmentioning

confidence: 99%

Survey of Distributed Computing Frameworks for Supporting Big Data Analysis

Sun

et al. 2023

Big Data Min. Anal.

View full text Add to dashboard Cite

Distributed computing frameworks are the fundamental component of distributed computing systems.They provide an essential way to support the efficient processing of big data on clusters or cloud. The size of big data increases at a pace that is faster than the increase in the big data processing capacity of clusters. Thus, distributed computing frameworks based on the MapReduce computing model are not adequate to support big data analysis tasks which often require running complex analytical algorithms on extremely big data sets in terabytes.In performing such tasks, these frameworks face three challenges: computational inefficiency due to high I/O and communication costs, non-scalability to big data due to memory limit, and limited analytical algorithms because many serial algorithms cannot be implemented in the MapReduce programming model. New distributed computing frameworks need to be developed to conquer these challenges. In this paper, we review MapReduce-type distributed computing frameworks that are currently used in handling big data and discuss their problems when conducting big data analysis. In addition, we present a non-MapReduce distributed computing framework that has the potential to overcome big data analysis challenges.

show abstract

Section: Sparkmentioning

confidence: 99%

Survey of Distributed Computing Frameworks for Supporting Big Data Analysis

Sun

et al. 2023

Big Data Min. Anal.

View full text Add to dashboard Cite

show abstract

“…Likewise, the in-house Hadoop cluster setup and Amazon EC2 instances are also used to evaluate the Hadoop performance. Khan et al [30] have modeled the estimation of the provisioning of the resources and completion time of the jobs. Furthermore, the Hadoop and Spark-based distributed system performance has been evaluated by Taran et al [31].…”

Section: Related Workmentioning

confidence: 99%

A Comparative Analysis of Hadoop and Spark Frameworks using Word Count Algorithm

Benlachimi¹,

El²,

Lahcen³

2021

IJACSA

View full text Add to dashboard Cite

With the advent of the Big Data explosion due to the Information Technology (IT) revolution during the last few decades, the need for processing and analyzing the data at low cost in minimum time has become immensely challenging. The field of Big Data analytics is driven by the demand to process Machine Learning (ML) data, real-time streaming data, and graphics processing. The most efficient solutions to Big Data analysis in a distributed environment are Hadoop and Spark administered by Apache, both these solutions are open-source data management frameworks and they allow to distribute and compute the large datasets across multiple clusters of computing nodes. This paper provides a comprehensive comparison between Apache Hadoop & Apache Spark in terms of efficiency, scalability, security, cost-effectiveness, and other parameters. It describes primary components of Hadoop and Spark frameworks to compare their performance. The major conclusion is that Spark is better in terms of scalability and speed for real-time streaming applications; whereas, Hadoop is more viable for applications dealing with bigger datasets. This case study evaluates the performance of various components of Hadoop-such, MapReduce, and Hadoop Distributed File System (HDFS) by applying it to the well-known Word Count algorithm to ascertain its efficacy in terms of storage and computational time. Subsequently, it also provides an analysis of how Spark's in-line memory processing could reduce the computational time of the Word Count Algorithm.

show abstract

“…The primary reason for the performance decline was evident as Spark cache size could not fit into the memory for the larger dataset. Taran et al [34] quantified performance differences of Hadoop and Spark using WordCount dataset which was ranging from 100 KB to 1 GB. It was observed that Hadoop framework was five times faster than Spark when the evaluation was performed using a larger set of data sources.…”

Section: Processing Speedmentioning

confidence: 99%

Big Data in Cloud Computing: A Resource Management Perspective

Ullah

Awan

Khiyal

2018

Scientific Programming

View full text Add to dashboard Cite

The modern day advancement is increasingly digitizing our lives which has led to a rapid growth of data. Such multidimensional datasets are precious due to the potential of unearthing new knowledge and developing decision-making insights from them. Analyzing this huge amount of data from multiple sources can help organizations to plan for the future and anticipate changing market trends and customer requirements. While the Hadoop framework is a popular platform for processing larger datasets, there are a number of other computing infrastructures, available to use in various application domains. The primary focus of the study is how to classify major big data resource management systems in the context of cloud computing environment. We identify some key features which characterize big data frameworks as well as their associated challenges and issues. We use various evaluation metrics from different aspects to identify usage scenarios of these platforms. The study came up with some interesting findings which are in contradiction with the available literature on the Internet.

show abstract

Performance evaluation of distributed computing environments with Hadoop and Spark frameworks

Cited by 18 publications

References 16 publications

Survey of Distributed Computing Frameworks for Supporting Big Data Analysis

Survey of Distributed Computing Frameworks for Supporting Big Data Analysis

A Comparative Analysis of Hadoop and Spark Frameworks using Word Count Algorithm

Big Data in Cloud Computing: A Resource Management Perspective

Contact Info

Product

Resources

About