Past, Present and Future of Hadoop: A Survey

Zarei, Ameneh; Safari, Shahla; Ahmadi, Mahmood; Mardukhi, Farhad

doi:10.48550/arxiv.2202.13293

Cited by 3 publications

(4 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With the help of this tool, we can construct and execute Map Reduce tasks using any script as the mapper and reducer. The mapper uses the Hadoop streaming tool Stdin to read the data, provides the mapped key value pairs to the reducer, and uses Stdout to output the results of the reducing operation, which are then stored in the HDFS [10], The mapper phase is responsible for calculating the Euclidean distance between the training set and the target point, and the output of the mapper is a set of pairs of <Distance, Class> that serve as the reducer's input. In reducer, the minimum k distances are determined, and the class with maximum frequency will represent the predicted class.…”

Section: Hadoop Streamingmentioning

confidence: 99%

A Parallel Approach for Optimizing KNN Classification Algorithm in Big Data

Aljanabi,

Aljanabi

2023

AJEST

View full text Add to dashboard Cite

Big data classification is the study of how to classify large amounts of data, which conventional data mining methods typically find challenging to handle. The most popular data mining technique is the K-Nearest Neighbor(KNN) classifier because of its efficiency and simplicity. The sequential KNN classifier can't handle a huge amount of data due to its highly required calculation nature, so it is improved by using a parallel technique supported by Map Reduce. This model enables us to classify the massive amount of data that exceeds terabytes. Similar to the original KNN model, this parallel implementation offers the same classification rate but less time complexity. In this paper, parallel KNN with Map Reduce has been proposed using a Hadoop multi-data node cluster. First, the dataset split into n blocks, and in each node, the mapper and reducer will be executed. The mapper phase is responsible for calculating the Euclidean distance between the training set and the targeted point, and the output of the mapper is a set of pairs of <Distance, Class> that serve as the reducer's input. In reducer, the minimum k distances are determined, and the class with maximum occurrence will represent the predicted class. The results showed a significant improvement in time complexity for the proposed approach over the traditional one. A New York criminal data set with a size of 6.5 million records was used in this work. The tuples Latitude and Longitude were used to determine the nearest neighbors, while the Patrol-BORO was used as class label.

show abstract

Section: Hadoop Streamingmentioning

confidence: 99%

A Parallel Approach for Optimizing KNN Classification Algorithm in Big Data

Aljanabi,

Aljanabi

2023

AJEST

View full text Add to dashboard Cite

show abstract

“…It has two main subprojects, including Hadoop distributed file system (HDFS) and MapReduce programming paradigm [23]. The other subprojects, such as YARN, Common, Hbase, Hive, Ozone, and Zookeeper, provide complementary services [24]. Hadoop is suited to high-throughput and in-depth analysis where a larger portion or all of the data is harnessed [25].…”

Section: Hadoop Frameworkmentioning

confidence: 99%

“…All the divisions are processed simultaneously [26] and are parsed into (key, value) pair records. Map function replicates these records and maps each of them to a set of intermediate (key, value) pairs [24]. At last, reducers combine them to get a consolidated output.…”

Section: Hadoop Frameworkmentioning

confidence: 99%

Virtual Machine Placement Optimization for Big Data Applications in Cloud Computing

Seyyedsalehi

Khansari

2022

IEEE Access

View full text Add to dashboard Cite

Big data and cloud computing are two advanced technologies that have overcome many computing and analytical challenges in recent years. With the rise in the applications of these technologies, the necessity of efficiency and optimization in the utilization of related resources has made sense. The procedure of locating virtual machines (VM) in physical machines (PM) affects the performance, speed, and costs of cloud computing services. VM placement in cloud computing is an NP-hard problem. Indeed, the problem is more complicated in big data tasks due to the need for transferring high volumes of traffic between VMs. This paper proposes a new approach for VM placement in a multi data center (DC) cloud environment. The aware genetic algorithm first fit (AGAFF) is a context-aware algorithm that distinguishes big data tasks with an input tag and uses a structure to minimize the traffic between MapReduce nodes. This multi-objective algorithm is based on the genetic algorithm, which is incorporated with the first fit methodology. The algorithm minimizes energy usage by minimizing the number of used servers, intra-DC traffic of big data tasks, and VMs' live migration while maximizing relevant usage of CPU and RAM in every server. Furthermore, it improves job execution time, especially in big data processing, and reduces service level agreement (SLA) violations. A comparison between the results of AGAFF and four other algorithms shows by about 61% energy consumption reduction on average on different scales and approves a decrease in the number of needed PMs, intra-DC traffic of big data processing, and the number of live migrations.

show abstract

“…When compared to the conventional methodology, issue solving using metaheuristic approaches was better since the researched space's dimensions expanded. The MapReduce framework was the main emphasis of the authors' study in [13,14,42,[63][64][65], as well as its limitations, problems with job scheduling between nodes, and other algorithms presented by different academics. These algorithms were then categorized based on a variety of performance-related quality indicators in some of these studies.…”

Section: Other Improved Algorithmsmentioning

confidence: 99%

An Overview of Hadoop Job Scheduling Algorithms for Big Data

Zameel

Zengi̇n

2022

Mugla Journal of Science and Technology

View full text Add to dashboard Cite

Rapid advancements in Big data systems have occurred over the last several decades. The significant element for attaining high performance is "Job Scheduling" in Big data systems which requires more utmost attention to resolve some challenges of scheduling. To obtain higher performance when processing the big data, proper scheduling is required. Apache Hadoop is most commonly used to manage immense data volumes in an efficient way and also proficient in handling the issues associated with job scheduling. To improve performance of big data systems, we significantly analyzed various Hadoop job scheduling algorithms. To get an overall idea about the scheduling algorithm, this paper presents a rigorous background. This paper made an overview on the fundamental architecture of Hadoop Big data framework, job scheduling and its issues, then reviewed and compared the most important and fundamental Hadoop job scheduling algorithms. In addition, this paper includes a review of other improved algorithms. The primary objective is to present an overview of various scheduling algorithms to improve performance when analyzing big data. This study will also provide appropriate direction in terms of job scheduling algorithm to the researcher according to which characteristics are most significant

show abstract

Past, Present and Future of Hadoop: A Survey

Cited by 3 publications

References 0 publications

A Parallel Approach for Optimizing KNN Classification Algorithm in Big Data

A Parallel Approach for Optimizing KNN Classification Algorithm in Big Data

Virtual Machine Placement Optimization for Big Data Applications in Cloud Computing

An Overview of Hadoop Job Scheduling Algorithms for Big Data

Contact Info

Product

Resources

About