Hani Al-Sayeh scite author profile

Gray Box Modeling Methodology for Runtime Prediction of Apache Spark Jobs

Al-Sayeh

¹

,

Sattler

²

2019

View full text Add to dashboard Cite

Apache Spark jobs are often characterized by processing huge data sets and, therefore, require runtimes in the range of minutes to hours. Thus, being able to predict the runtime of such jobs would be useful not only to know when the job will finish, but also for scheduling purposes, to estimate monetary costs for cloud deployment, or to determine an appropriate cluster configuration, such as the number of nodes. However, predicting Spark job runtimes is much more challenging than for standard database queries: cluster configuration and parameters have a significant performance impact and jobs usually contain a lot of user-defined code making it difficult to estimate cardinalities and execution costs. In this paper, we present a gray-box modeling methodology for runtime prediction of Apache Spark jobs. Our approach comprises two steps: first, a white-box model for predicting the cardinalities of the input RDDs of each operator is built based on prior knowledge about the behavior and application parameters such as applied filters data, number of iterations, etc. In the second step, a black-box model for each task constructed by monitoring runtime metrics while varying allocated resources and input RDD cardinalities is used. We further show how to use this gray-box approach not only for predicting the runtime of a given job, but also as part of a decision model for reusing intermediate cached results of Spark jobs. Our methodology is validated with experimental evaluation showing a highly accurate prediction of the actual job runtime and a performance improvement if intermediate results can be reused.

show abstract

Juggler: Autonomous Cost Optimization and Performance Prediction of Big Data Applications

Al-Sayeh

¹

,

Memishi

²

,

Jibril

³

et al. 2022

View full text Add to dashboard Cite

Blink: Lightweight Sample Runs for Cost Optimization of Big Data Applications

Al-Sayeh¹,

Jibril²,

Memishi³

et al. 2022

Preprint

View full text Add to dashboard Cite

Distributed in-memory data processing engines accelerate iterative applications by caching substantial datasets in memory rather than recomputing them in each iteration. Selecting a suitable cluster size for caching these datasets plays an essential role in achieving optimal performance. In practice, this is a tedious and hard task for end users, who are typically not aware of cluster specifications, workload semantics and sizes of intermediate data.We present Blink, an autonomous sampling-based framework, which predicts sizes of cached datasets and selects optimal cluster size without relying on historical runs. We evaluate Blink on a variety of iterative, real-world, machine learning applications. With an average sample runs cost of 4.6 % compared to the cost of optimal runs, Blink selects the optimal cluster size in 15 out of 16 cases, saving up to 47.4 % of execution cost compared to average costs.

show abstract

A gray-box modeling methodology for runtime prediction of Apache Spark jobs

Al-Sayeh

¹

,

Hagedorn

²

,

Sattler

³

2020

Distrib Parallel Databases

View full text Add to dashboard Cite

Apache Spark jobs are often characterized by processing huge data sets and, therefore, require runtimes in the range of minutes to hours. Thus, being able to predict the runtime of such jobs would be useful not only to know when the job will finish, but also for scheduling purposes, to estimate monetary costs for cloud deployment, or to determine an appropriate cluster configuration, such as the number of nodes. However, predicting Spark job runtimes is much more challenging than for standard database queries: cluster configuration and parameters have a significant performance impact and jobs usually contain a lot of user-defined code making it difficult to estimate cardinalities and execution costs. In this paper, we present a gray-box modeling methodology for runtime prediction of Apache Spark jobs. Our approach comprises two steps: first, a white-box model for predicting the cardinalities of the input RDDs of each operator is built based on prior knowledge about the behavior and application parameters such as applied filters data, number of iterations, etc. In the second step, a black-box model for each task constructed by monitoring runtime metrics while varying allocated resources and input RDD cardinalities is used. We further show how to use this gray-box approach not only for predicting the runtime of a given job, but also as part of a decision model for reusing intermediate cached results of Spark jobs. Our methodology is validated with experimental evaluation showing a highly accurate prediction of the actual job runtime and a performance improvement if intermediate results can be reused.

show abstract

Blink: Lightweight Sample Runs for Cost Optimization of Big Data Applications

Al-Sayeh¹,

Jibril²,

Memishi³

et al. 2022

4

0

View full text Add to dashboard Cite

Hani Al-Sayeh

Gray Box Modeling Methodology for Runtime Prediction of Apache Spark Jobs

Juggler: Autonomous Cost Optimization and Performance Prediction of Big Data Applications

Blink: Lightweight Sample Runs for Cost Optimization of Big Data Applications

A gray-box modeling methodology for runtime prediction of Apache Spark jobs

Blink: Lightweight Sample Runs for Cost Optimization of Big Data Applications

Contact Info

Product

Resources

About