Estimating runtime of a job in Hadoop MapReduce

Peyravi, Narges; Moeini, Ali

doi:10.1186/s40537-020-00319-4

Cited by 12 publications

(2 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This work considers using ML oracle to predict job sizes. Recent works (Amiri and Mohammad-Khanli 2017;Peyravi and Moeini 2020;Yamashiro and Nonaka 2021) have shown that job sizes are highly predictable in many scenarios, e.g., cloud, clusters, and factories. In addition, when the prediction is accurate, we have the 2-relaxed decision procedure guaranteeing a near-optimal makespan, and when the prediction goes arbitrarily bad, the existing O(log m)-competitive algorithm can bound the performance.…”

Section: Oracle and Prediction Errormentioning

confidence: 99%

Uniform Machine Scheduling with Predictions

Zhao

Zomaya

2022

ICAPS

View full text Add to dashboard Cite

The revival in learning theory has provided us with improved capabilities for accurate predictions. This work contributes to an emerging research agenda of online scheduling with predictions by studying the makespan minimization in uniformly related machine non-clairvoyant scheduling with job size predictions. Our task is to design online algorithms that effectively use predictions and have performance guarantees with varying prediction quality. We first propose a simple algorithm-independent prediction error measurement to quantify prediction quality. To effectively use the predicted job sizes, we design an offline improved 2-relaxed decision procedure approximating the optimal schedule. With this decision procedure, we propose an online O(min{log eta, log m})-competitive algorithm that assumes a known prediction error. Finally, we extend this algorithm to construct a robust O(min{log eta, log m})-competitive algorithm that does not assume a known error. Both algorithms require only moderate predictions to improve the well-known Omega(log m) lower bound, showing the potential of using predictions in managing uncertainty.

show abstract

Section: Oracle and Prediction Errormentioning

confidence: 99%

Uniform Machine Scheduling with Predictions

Zhao

Zomaya

2022

ICAPS

View full text Add to dashboard Cite

show abstract

“…Zhu et al [16] proposed BestConfig, which uses the divide-anddiverge sampling method and the recursive-bound-and-search method for parameter tuning of general systems with resource constraints. Peyravi et al [17] reported the runtime efficiency of a MapReduce job by considering three categories of parameters that have higher impact on the runtime. They have modeled the runtime efficiency during each phase of the Hadoop execution pipeline using a weighing system based of job history.…”

Section: Related Workmentioning

confidence: 99%

An Analytic solution for the Hadoop Configuration Combinatorial Puzzle based on General Factorial Design

2022

KSII TIIS

View full text Add to dashboard Cite

Big data analytics offers endless opportunities for operational enhancement by extracting valuable insights from complex voluminous data. Hadoop is a comprehensive technological suite which offers solutions for the large scale storage and computing needs of Big data. The performance of Hadoop is closely tied with its configuration settings which depends on the cluster capacity and the application profile. Since Hadoop has over 190 configuration parameters, tuning them to gain optimal application performance is a daunting challenge. Our approach is to extract a subset of impactful parameters from which the performance enhancing sub-optimal configuration is then narrowed down. This paper presents a statistical model to analyze the significance of the effect of Hadoop parameters on a variety of performance metrics. Our model decomposes the total observed performance variation and ascribes them to the main parameters, their interaction effects and noise factors. The method clearly segregates impactful parameters from the rest. The configuration setting determined by our methodology has reduced the Job completion time by 22%, resource utilization in terms of memory and CPU by 15% and 12% respectively, the number of killed Maps by 50% and Disk spillage by 23%. The proposed technique can be leveraged to ease the configuration tuning task of any Hadoop cluster despite the differences in the underlying infrastructure and the application running on it.

show abstract