An adaptive scheduling algorithm for heterogeneous Hadoop systems

Han, Jiazhen; Yuan, Zhengheng; Han, Yiheng; Cheng, Peng; Jing, Liu; Li, Guangli

doi:10.1109/icis.2017.7960110

Cited by 11 publications

(5 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent analytic applications demand the use of streaming information, computations, and computations for real-time data processing. High computing and real-time data processing have exposed that the former operational principles of fairness and deadline-based scheduling were insufficient to produce proportional uses in heterogeneous settings [16]. The high-consumption environment demands a highly efficient scheduler to get the best computing power and meet consumer expectations.…”

Section: Reviewmentioning

confidence: 99%

Performance Improvement through Novel Adaptive Node and Container Aware Scheduler with Resource Availability Control in Hadoop YARN

Manjaly,

Subbulakshmi

2023

CSSE

View full text Add to dashboard Cite

The default scheduler of Apache Hadoop demonstrates operational inefficiencies when connecting external sources and processing transformation jobs. This paper has proposed a novel scheduler for enhancement of the performance of the Hadoop Yet Another Resource Negotiator (YARN) scheduler, called the Adaptive Node and Container Aware Scheduler (ANACRAC), that aligns cluster resources to the demands of the applications in the real world. The approach performs to leverage the user-provided configurations as a unique design to apportion nodes, or containers within the nodes, to application thresholds. Additionally, it provides the flexibility to the applications for selecting and choosing which node's resources they want to manage and adds limits to prevent threshold breaches by adding additional jobs as needed. Node or container awareness can be utilized individually or in combination to increase efficiency. On top of this, the resource availability within the node and containers can also be investigated. This paper also focuses on the elasticity of the containers and self-adaptiveness depending on the job type. The results proved that 15%-20% performance improvement was achieved compared with the node and container awareness feature of the ANACRAC. It has been validated that this ANACRAC scheduler demonstrates a 70%-90% performance improvement compared with the default Fair scheduler. Experimental results also demonstrated the success of the enhancement and a performance improvement in the range of 60% to 200% when applications were connected with external interfaces and high workloads.

show abstract

Section: Reviewmentioning

confidence: 99%

Performance Improvement through Novel Adaptive Node and Container Aware Scheduler with Resource Availability Control in Hadoop YARN

Manjaly,

Subbulakshmi

2023

CSSE

View full text Add to dashboard Cite

show abstract

“…applications requiring a rapid response) become more prevalent, the Hadoop system shows its inadequacy in completing jobs on time. Encouraged by this, Han et al [53] proposed the CP-Scheduler algorithm, which utilises an optimizer to determine the optimal schedule for minimising the percentage of delayed jobs. Additionally, the CP-Scheduler algorithm adapts to different remote machines, which is not always the case with the Hadoop system.…”

Section: Deadline/resource-aware S Chedulementioning

confidence: 99%

Factors affecting cloud data-center efficiency: a scheduling algorithm-based analysis

Shehloo¹,

Butt²,

Zaman³

2021

IJATEE

View full text Add to dashboard Cite

Nowadays, users are required to cache, scrutinise, and process massive datasets from various fields, including science, business, and research. As a result, they require data-intensive platforms with ample storage and processing power. In addition, many of these kinds of platforms must-have features like parallel processing, fault tolerance, data dissemination, scalability, availability, and load balancing. Google developed the MapReduce programming paradigm to counter this problem, which served as the foundation for Apache's open-source Hadoop project. *Author for correspondence Hadoop relies upon a particular file system designated as HDFS, analogous to Google's File-System (GFS). It splits the massive data into equally sized segments and then places them across multiple nodes in a Hadoop cluster [1]. As a result, Hadoop is now widely accepted as a data analytics model [2]. Hadoop's fundamental operating principle is that "moving computation to data is less expensive than moving data to computation." As a result, Hadoop tries to schedule tasks on local data nodes to minimise network traffic [3]. Task scheduling is critical in Hadoop because it significantly impacts the framework's computation time and, thus, its overall performance [4]. However, given the dynamic nature of the cloud environment, proposing an effective task scheduling strategy is a constant challenge. Nevertheless, only a few studies have analyzed the proposed techniques and their overall effect on the Hadoop framework's

show abstract

“…As we are modifying fair scheduler, a fair share of resources among jobs must be ensured at every C t . For all jobs, the amount vCPU and Memory for map/reduce tasks (scheduled+running+finished) is found at C t and verified with the overall resources available in the virtual cluster using Equation (7). ∀j,…”

Section: Objective Functionmentioning

confidence: 99%

“…Availing MapReduce on a cluster of VMs is highly scalable and based on a pay‐per‐use model, which attract the short‐term users. Although it is cost‐effective, there are some performance implications due to the heterogeneity that exists in different levels (from a cluster of PMs till a batch of MapReduce jobs) while offering MapReduce on a cluster of VMs, as shown in Figure . For instance, consider a set of PMs (

P M_{1}, P M_{2} … P M_{50}

), a set of VMs ( VM 1 , VM 2 … VM 100 ) with different VM flavors ( VMF 1 , VMF 2 … VMF 5 ), as given in Table , and a set of MapReduce jobs ( J 1 , J 2 … J 6 ).…”

Section: Introductionmentioning

confidence: 99%

Improving MapReduce scheduler for heterogeneous workloads in a heterogeneous environment

Jeyaraj

Ananthanarayana

Paul

2019

Concurrency and Computation

View full text Add to dashboard Cite

Summary Big data is largely influencing business entities and research sectors to be more data‐driven. Hadoop MapReduce is one of the cost‐effective ways to process large scale datasets and offered as a service over the Internet. Even though cloud service providers promise an infinite amount of resources available on‐demand, it is inevitable that some of the hired virtual resources for MapReduce are left unutilized and makespan is limited due to various heterogeneities that exist while offering MapReduce as a service. As MapReduce v2 allows users to define the size of containers for the map and reduce tasks, jobs in a batch become heterogeneous and behave differently. Also, the different capacity of virtual machines in the MapReduce virtual cluster accommodate a varying number of map/reduce tasks. These factors highly affect resource utilization in the virtual cluster and the makespan for a batch of MapReduce jobs. Default MapReduce job schedulers do not consider these heterogeneities that exist in a cloud environment. Moreover, virtual machines in MapReduce virtual cluster process an equal number of blocks regardless of their capacity, which affects the makespan. Therefore, we devised a heuristic‐based MapReduce job scheduler that exploits virtual machine and MapReduce workload level heterogeneities to improve resource utilization and makespan. We proposed two methods to achieve this: (i) roulette wheel scheme based data block placement in heterogeneous virtual machines, and (ii) a constrained 2‐dimensional bin packing to place heterogeneous map/reduce tasks. We compared heuristic‐based MapReduce job scheduler against the classical fair scheduler in MapReduce v2. Experimental results showed that our proposed scheduler improved makespan and resource utilization by 45.6% and 47.9% over classical fair scheduler.

show abstract

An adaptive scheduling algorithm for heterogeneous Hadoop systems

Cited by 11 publications

References 13 publications

Performance Improvement through Novel Adaptive Node and Container Aware Scheduler with Resource Availability Control in Hadoop YARN

Performance Improvement through Novel Adaptive Node and Container Aware Scheduler with Resource Availability Control in Hadoop YARN

Factors affecting cloud data-center efficiency: a scheduling algorithm-based analysis

Improving MapReduce scheduler for heterogeneous workloads in a heterogeneous environment

Contact Info

Product

Resources

About