2021
DOI: 10.3390/bdcc5040065
|View full text |Cite
|
Sign up to set email alerts
|

An Enhanced Parallelisation Model for Performance Prediction of Apache Spark on a Multinode Hadoop Cluster

Abstract: Big data frameworks play a vital role in storing, processing, and analysing large datasets. Apache Spark has been established as one of the most popular big data engines for its efficiency and reliability. However, one of the significant problems of the Spark system is performance prediction. Spark has more than 150 configurable parameters, and configuration of so many parameters is challenging task when determining the suitable parameters for the system. In this paper, we proposed two distinct parallelisation… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(3 citation statements)
references
References 39 publications
0
3
0
Order By: Relevance
“…This diagram shows the prevalence of economic activities across various economic sectors, shedding light on the dynamics of procurement activities within each division. For the efficient processing and analysis of text descriptors at scale, we employed Apache Spark, a high-performance framework for distributed data processing [16][17][18][19]. The approach involved vectorizing descriptor texts using the Term Frequency-Inverse Document Frequency (TF-IDF) method, which assesses the importance of each word in the context of the entire text corpus.…”
Section: Dataset Descriptionmentioning
confidence: 99%
“…This diagram shows the prevalence of economic activities across various economic sectors, shedding light on the dynamics of procurement activities within each division. For the efficient processing and analysis of text descriptors at scale, we employed Apache Spark, a high-performance framework for distributed data processing [16][17][18][19]. The approach involved vectorizing descriptor texts using the Term Frequency-Inverse Document Frequency (TF-IDF) method, which assesses the importance of each word in the context of the entire text corpus.…”
Section: Dataset Descriptionmentioning
confidence: 99%
“…Matteussi et al [17] provided a comprehensive performance evaluation of Spark Streaming backpressure. Ahmed et al [18] proposed two different parallelization models for performance prediction. Zhu et al in [19] proposed a model to capture the execution behavior of tasks, phases, and jobs, and based on the model, implemented a prototype system.…”
Section: Related Workmentioning
confidence: 99%
“…Upon computing the dot-product attention scores, the third step involves leveraging these scores to compute the weighted sum of features for each data point, yielding the novel feature representation for each head. For head l (where l ∈ {1, 2, 3, 4}) and data point i, the new feature representation Q li can be acquired using Equation (18).…”
mentioning
confidence: 99%