Handling data skew at reduce stage in Spark by ReducePartition

Guo, Wenxia; Huang, Chaojie; Tian, Wenhong

doi:10.1002/cpe.5637

Cited by 4 publications

(1 citation statement)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Huang and Wei [38] leveraged the skew detection algorithm to identify the skew partition and adjusted the task resource allocation according to the fine-grained resource allocation algorithm. Guoand Huang [39] took into account the differences in computational capabilities among the computing nodes and assigned each task to the computing nodes with the highest performance factor according to the greedy strategy. Li and Zhang [40] established a virtual partition for data partition with a huge amount of data, then used the hash partition method to further partition the data to alleviate the calculation pressure.…”

Section: Introductionmentioning

confidence: 99%

A Cluster-Based Partition Method of Remote Sensing Data for Efficient Distributed Image Processing

Chen

et al. 2022

Remote Sensing

View full text Add to dashboard Cite

Data stream partitioning is a fundamental and important mechanism for distributed systems. However, use of an inappropriate partition scheme may generate a data skew problem, which can influence the execution efficiency of many application tasks. Processing of skewed partitions usually takes a longer time, need more computational resources to complete the task and can even become a performance bottleneck. To solve such data skew issues, this paper proposes a novel partition method to divide on demand the image tiles uniformly into partitions. The partitioning problem is then transformed into a uniform and compact clustering problem whereby the image tiles are regarded as image pixels without spectrum and texture information. First, the equal area conversion principle was used to select the seed points of the partitions and then the image tiles were aggregated in an image layout, thus achieving an initial partition scheme. Second, the image tiles of the initial partition were finely adjusted in the vertical and horizontal directions in separate steps to achieve a uniform distribution among the partitions. Two traditional partition methods were adopted to evaluate the efficiency of the proposed method in terms of the image segmentation testing, data shuffle testing, and image clipping testing. The results demonstrated that the proposed partition method solved the data skew problem observed in the hash partition method. In addition, this method is designed specifically for processing of image tiles and makes the related processing operations for large-scale images faster and more efficient.

show abstract

Section: Introductionmentioning

confidence: 99%