Machine Learning for Performance Prediction of Spark Cloud Applications

Maros, Alexandre; Murai, Fabrício; Silva, Ana Paula Couto da; Almeida, Jussara M.; Lattuada, Marco; Gianniti, Eugenio; Hosseini, Marjan; Ardagna, Danilo

doi:10.1109/cloud.2019.00028

Cited by 27 publications

(21 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Maros [22] conducted a cost-benefit analysis of a supervised machine learning model for Spark performance prediction and compared their results with Ernest [23]. In this investigation, they considered the black box and gray box techniques.…”

Section: Related Workmentioning

confidence: 99%

A parallelization model for performance characterization of Spark Big Data jobs on Hadoop clusters

et al. 2021

View full text Add to dashboard Cite

This article proposes a new parallel performance model for different workloads of Spark Big Data applications running on Hadoop clusters. The proposed model can predict the runtime for generic workloads as a function of the number of executors, without necessarily knowing how the algorithms were implemented. For a certain problem size, it is shown that a model based on serial boundaries for a 2D arrangement of executors can fit the empirical data for various workloads. The empirical data was obtained from a real Hadoop cluster, using Spark and HiBench. The workloads used in this work were included WordCount, SVM, Kmeans, PageRank and Graph (Nweight). A particular runtime pattern emerged when adding more executors to run a job. For some workloads, the runtime was longer with more executors added. This phenomenon is predicted with the new model of parallelisation. The resulting equation from the model explains certain performance patterns that do not fit Amdahl’s law predictions, nor Gustafson’s equation. The results show that the proposed model achieved the best fit with all workloads and most of the data sizes, using the R-squared metric for the accuracy of the fitting of empirical data. The proposed model has advantages over machine learning models due to its simplicity, requiring a smaller number of experiments to fit the data. This is very useful to practitioners in the area of Big Data because they can predict runtime of specific applications by analysing the logs. In this work, the model is limited to changes in the number of executors for a fixed problem size.

show abstract

Section: Related Workmentioning

confidence: 99%

A parallelization model for performance characterization of Spark Big Data jobs on Hadoop clusters

et al. 2021

View full text Add to dashboard Cite

show abstract

“…The tools will automate the AI application performance profiling and identify the ML model providing the highest performance prediction accuracy supporting model selection and hyper-parameters tuning. Preliminary results in [4,5,6] have shown that ML models allow to achieve good accuracy (with average percentage error between 5 and 15%) in cloud environments. AI-SPRINT will extend the use of such models to consider AI-based sensors and deep networks partitioned and deployed across computing continua.…”

Section: Performance Modelsmentioning

confidence: 99%

“…Both private and public clouds will be valid targets. Applications will be described as OASIS TOSCA templates 4 describing the topology of their components and their software dependencies. Application templates will support restrictions for the deployment on multiple heterogeneous resources (i.e., including hardware accelerators, e.g., GPGPUs) also at the edge layer, by specific attributes (e.g., the target image, performance constraints or privacy requirements), which will be instantiated and managed by a single interaction with the AI-SPRINT framework.…”

Section: Continuous Deploymentmentioning

confidence: 99%

Advancing Design and Runtime Management of AI Applications with AI-SPRINT (Position Paper)

Sedghani

Ardagna

Matteucci

et al. 2021

2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC)

Self Cite

View full text Add to dashboard Cite

The adoption of Artificial intelligence (AI) technologies is steadily increasing. However, to become fully pervasive, AI needs resources at the edge of the network. The cloud can provide the processing power needed for big data, but edge computing is close to where data are produced and therefore crucial to their timely, flexible, and secure management. In this paper, we introduce the AI-SPRINT "Artificial intelligence in Secure PRIvacy-preserving computing coNTinuum" project, which will provide solutions to seamlessly design, partition, and run AI applications in computing continuum environments. AI-SPRINT will offer novel tools for AI applications development, secure execution, easy deployment, as well as runtime management and optimization: AI-SPRINT design tools will allow trading-off application performance (in terms of end-to-end latency or throughput), energy efficiency, and AI models accuracy while providing security and privacy guarantees. The runtime environment will support live data protection, architecture enhancement, agile delivery, runtime optimization, and continuous adaptation.

show abstract

“…A cost-benefit Spark performance prediction model based on a machine learning algorithm was proposed by Maros [30]. They have proposed both black-box and grey-box models based on four machine learning algorithms.…”

Section: Related Workmentioning

confidence: 99%

An Enhanced Parallelisation Model for Performance Prediction of Apache Spark on a Multinode Hadoop Cluster

Ahmed

Barczak

Rashid

et al. 2021

BDCC

View full text Add to dashboard Cite

Big data frameworks play a vital role in storing, processing, and analysing large datasets. Apache Spark has been established as one of the most popular big data engines for its efficiency and reliability. However, one of the significant problems of the Spark system is performance prediction. Spark has more than 150 configurable parameters, and configuration of so many parameters is challenging task when determining the suitable parameters for the system. In this paper, we proposed two distinct parallelisation models for performance prediction. Our insight is that each node in a Hadoop cluster can communicate with identical nodes, and a certain function of the non-parallelisable runtime can be estimated accordingly. Both models use simple equations that allows us to predict the runtime when the size of the job and the number of executables are known. The proposed models were evaluated based on five HiBench workloads, Kmeans, PageRank, Graph (NWeight), SVM, and WordCount. The workload’s empirical data were fitted with one of the two models meeting the accuracy requirements. Finally, the experimental findings show that the model can be a handy and helpful tool for scheduling and planning system deployment.

show abstract

Machine Learning for Performance Prediction of Spark Cloud Applications

Cited by 27 publications

References 14 publications

A parallelization model for performance characterization of Spark Big Data jobs on Hadoop clusters

A parallelization model for performance characterization of Spark Big Data jobs on Hadoop clusters

Advancing Design and Runtime Management of AI Applications with AI-SPRINT (Position Paper)

An Enhanced Parallelisation Model for Performance Prediction of Apache Spark on a Multinode Hadoop Cluster

Contact Info

Product

Resources

About