A virtual machine based task scheduling approach to improving data locality for virtualized Hadoop

Sun, Ruiqi; Yang, Jie; Gao, Zhan; He, Zhiqiang

doi:10.1109/icis.2014.6912150

Cited by 6 publications

(4 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Nonetheless, legacy improvements of data locality in virtualized Hadoop employ two levels of distribution of data (VM level and physical node level) which is not effective. DSFvH (Sun et al, 2014) presented a flexible virtualized Hadoop system in which storage and computing nodes are placed in their respective VMs. The DSFvH task scheduling algorithm aims to improve data locality by migrating the computing VMs to the physical node hosting the storage VM, which holds the data replica for the scheduled task.…”

Section: Cloud-based Schedulersmentioning

confidence: 99%

Jargon of Hadoop MapReduce scheduling techniques: a scientific categorization

Hanif

Lee

2019

The Knowledge Engineering Review

View full text Add to dashboard Cite

Recently, valuable knowledge that can be retrieved from a huge volume of datasets (called Big Data) set in motion the development of frameworks to process data based on parallel and distributed computing, including Apache Hadoop, Facebook Corona, and Microsoft Dryad. Apache Hadoop is an open source implementation of Google MapReduce that attracted strong attention from the research community both in academia and industry. Hadoop MapReduce scheduling algorithms play a critical role in the management of large commodity clusters, controlling QoS requirements by supervising users, jobs, and tasks execution. Hadoop MapReduce comprises three schedulers: FIFO, Fair, and Capacity. However, the research community has developed new optimizations to consider advances and dynamic changes in hardware and operating environments. Numerous efforts have been made in the literature to address issues of network congestion, straggling, data locality, heterogeneity, resource under-utilization, and skew mitigation in Hadoop scheduling. Recently, the volume of research published in journals and conferences about Hadoop scheduling has consistently increased, which makes it difficult for researchers to grasp the overall view of research and areas that require further investigation. A scientific literature review has been conducted in this study to assess preceding research contributions to the Apache Hadoop scheduling mechanism. We classify and quantify the main issues addressed in the literature based on their jargon and areas addressed. Moreover, we explain and discuss the various challenges and open issue aspects in Hadoop scheduling optimizations.

show abstract

Section: Cloud-based Schedulersmentioning

confidence: 99%

Jargon of Hadoop MapReduce scheduling techniques: a scientific categorization

Hanif

Lee

2019

The Knowledge Engineering Review

View full text Add to dashboard Cite

show abstract

“…In this manner, CNs can be migrated to a "suitable" place based on the idea of "mobile computing". It is obvious that this deployment form offers several advantages over a centralized method [15] : (1) strong scalability, which allows for respective fluctuating numbers of CNs or SNs and (2) flexible migration, i.e., CNs can be migrated without considering any other SNs.…”

Section: Dynamic Migration Based Data Localitymentioning

confidence: 99%

“…Compared to that of the traditional Hadoop cluster, data locality in the DHCI architecture can be classified into three categories [16] , as illustrated in Fig. 4.…”

Section: Dynamic Migration Based Data Localitymentioning

confidence: 99%

See 1 more Smart Citation

Load feedback-based resource scheduling and dynamic migration-based data locality for virtual hadoop clusters in openstack-based clouds

Tao

Lin

Wang

2017

Tinshhua Sci. Technol.

View full text Add to dashboard Cite

With cloud computing technology becoming more mature, it is essential to combine the big data processing tool Hadoop with the Infrastructure as a Service (IaaS) cloud platform. In this study, we first propose a new Dynamic Hadoop Cluster on IaaS (DHCI) architecture, which includes four key modules: monitoring, scheduling, Virtual Machine (VM) management, and VM migration modules. The load of both physical hostsand VMs is collected by the monitoring module and can be used to design resource scheduling and data locality solutions. Second, we present a simple load feedback-based resource scheduling scheme. The resource allocation can be avoided on overburdened physical hosts or the strong scalability of virtual cluster can be achieved by fluctuating the number of VMs. To improve the flexibility, we adopt the separated deployment of the computation and storage VMs in the DHCI architecture, which negatively impacts the data locality. Third, we reuse the method of VM migration and propose a dynamic migration-based data locality scheme using parallel computing entropy. We migrate the computation nodes to different host(s) or rack(s) where the corresponding storage nodes are deployed to satisfy the requirement of data locality. We evaluate our solutions in a realistic scenario based on OpenStack. Substantial experimental results demonstrate the effectiveness of our solutions that contribute to balance the workload and performance improvement, even under heavy-loaded cloud system conditions.

show abstract

Scheduling Data-Intensive Workloads in Large-Scale Distributed Systems: Trends and Challenges

Stavrinides

Karatza

2018

Studies in Big Data

View full text Add to dashboard Cite

A virtual machine based task scheduling approach to improving data locality for virtualized Hadoop

Cited by 6 publications

References 21 publications

Jargon of Hadoop MapReduce scheduling techniques: a scientific categorization

Jargon of Hadoop MapReduce scheduling techniques: a scientific categorization

Load feedback-based resource scheduling and dynamic migration-based data locality for virtual hadoop clusters in openstack-based clouds

Scheduling Data-Intensive Workloads in Large-Scale Distributed Systems: Trends and Challenges

Contact Info

Product

Resources

About