An Enhanced Parallelisation Model for Performance Prediction of Apache Spark on a Multinode Hadoop Cluster

Ahmed, N.; Barczak, Andre L. C.; Rashid, M. A.; Sušnjak, Teo

doi:10.3390/bdcc5040065

Cited by 5 publications

(3 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This diagram shows the prevalence of economic activities across various economic sectors, shedding light on the dynamics of procurement activities within each division. For the efficient processing and analysis of text descriptors at scale, we employed Apache Spark, a high-performance framework for distributed data processing [16][17][18][19]. The approach involved vectorizing descriptor texts using the Term Frequency-Inverse Document Frequency (TF-IDF) method, which assesses the importance of each word in the context of the entire text corpus.…”

Section: Dataset Descriptionmentioning

confidence: 99%

Application of Natural Language Processing and Genetic Algorithm to Fine-Tune Hyperparameters of Classifiers for Economic Activities Analysis

Malashin,

Masich,

Tynchenko

et al. 2024

BDCC

View full text Add to dashboard Cite

This study proposes a method for classifying economic activity descriptors to match Nomenclature of Economic Activities (NACE) codes, employing a blend of machine learning techniques and expert evaluation. By leveraging natural language processing (NLP) methods to vectorize activity descriptors and utilizing genetic algorithm (GA) optimization to fine-tune hyperparameters in multi-class classifiers like Naive Bayes, Decision Trees, Random Forests, and Multilayer Perceptrons, our aim is to boost the accuracy and reliability of an economic classification system. This system faces challenges due to the absence of precise target labels in the dataset. Hence, it is essential to initially check the accuracy of utilized methods based on expert evaluations using a small dataset before generalizing to a larger one.

show abstract

Section: Dataset Descriptionmentioning

confidence: 99%

Application of Natural Language Processing and Genetic Algorithm to Fine-Tune Hyperparameters of Classifiers for Economic Activities Analysis

Malashin,

Masich,

Tynchenko

et al. 2024

BDCC

View full text Add to dashboard Cite

show abstract

“…Matteussi et al [17] provided a comprehensive performance evaluation of Spark Streaming backpressure. Ahmed et al [18] proposed two different parallelization models for performance prediction. Zhu et al in [19] proposed a model to capture the execution behavior of tasks, phases, and jobs, and based on the model, implemented a prototype system.…”

Section: Related Workmentioning

confidence: 99%

“…Upon computing the dot-product attention scores, the third step involves leveraging these scores to compute the weighted sum of features for each data point, yielding the novel feature representation for each head. For head l (where l ∈ {1, 2, 3, 4}) and data point i, the new feature representation Q li can be acquired using Equation (18).…”

mentioning

confidence: 99%

A Novel Multi-Task Performance Prediction Model for Spark

Shen,

Chen,

Rao

2023

Applied Sciences

View full text Add to dashboard Cite

Performance prediction of Spark plays a vital role in cluster resource management and system efficiency improvement. The performance of Spark is affected by several variables, such as the size of the input data, the computational power of the system, and the complexity of the algorithm. At the same time, less research has focused on multi-task performance prediction models for Spark. To address these challenges, we propose a multi-task Spark performance prediction model. The model integrates a multi-head attention mechanism and a convolutional neural network. It implements the prediction of execution times for single or multiple Spark applications. Firstly, the data are dimensionally reduced by a dimensionality reduction algorithm and fed into the model. Secondly, the model integrates a multi-head attention mechanism and a convolutional neural network. It captures complex relationships between data features and uses these features for Spark performance prediction. Finally, we use residual connections to prevent overfitting. To validate the performance of the model, we conducted experiments on four Spark benchmark applications. Compared to the benchmark prediction model, our model obtains better performance metrics. In addition, our model predicts multiple Spark benchmark applications simultaneously and maintains deviations within permissible limits. It provides a novel way for the assessment and optimization of Spark.

show abstract

Mjolnir: A framework agnostic auto-tuning system with deep reinforcement learning

Slimane¹,

Sagaama²,

Marwani³

et al. 2022

Appl Intell

View full text Add to dashboard Cite

An Enhanced Parallelisation Model for Performance Prediction of Apache Spark on a Multinode Hadoop Cluster

Cited by 5 publications

References 39 publications

Application of Natural Language Processing and Genetic Algorithm to Fine-Tune Hyperparameters of Classifiers for Economic Activities Analysis

Application of Natural Language Processing and Genetic Algorithm to Fine-Tune Hyperparameters of Classifiers for Economic Activities Analysis

A Novel Multi-Task Performance Prediction Model for Spark

Mjolnir: A framework agnostic auto-tuning system with deep reinforcement learning

Contact Info

Product

Resources

About