2019
DOI: 10.1002/cpe.5637
|View full text |Cite
|
Sign up to set email alerts
|

Handling data skew at reduce stage in Spark by ReducePartition

Abstract: Summary As a typical representative of distributed computing framework, Spark has been continuously developed and popularized. It reduces the data transmission time through efficient memory‐based operations and solves the shortcomings of the traditional MapReduce computation model in iterative computation. In Spark, data skew is very prominent due to the uneven distribution of input data and the unbalanced allocation of default partitioning algorithm. When data skew occurs, the execution efficiency of the prog… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(1 citation statement)
references
References 21 publications
0
1
0
Order By: Relevance
“…Huang and Wei [38] leveraged the skew detection algorithm to identify the skew partition and adjusted the task resource allocation according to the fine-grained resource allocation algorithm. Guoand Huang [39] took into account the differences in computational capabilities among the computing nodes and assigned each task to the computing nodes with the highest performance factor according to the greedy strategy. Li and Zhang [40] established a virtual partition for data partition with a huge amount of data, then used the hash partition method to further partition the data to alleviate the calculation pressure.…”
Section: Introductionmentioning
confidence: 99%
“…Huang and Wei [38] leveraged the skew detection algorithm to identify the skew partition and adjusted the task resource allocation according to the fine-grained resource allocation algorithm. Guoand Huang [39] took into account the differences in computational capabilities among the computing nodes and assigned each task to the computing nodes with the highest performance factor according to the greedy strategy. Li and Zhang [40] established a virtual partition for data partition with a huge amount of data, then used the hash partition method to further partition the data to alleviate the calculation pressure.…”
Section: Introductionmentioning
confidence: 99%