The Partition Cost Model for Load Balancing in MapReduce

Gufler, Benjamin; Augsten, Nikolaus; Reiser, Angelika; Kemper, Alfons

doi:10.1007/978-1-4614-2326-3_20

Cited by 8 publications

(5 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most existing works [2][3][4]20,21] only target the partitioning skew and neglect the computational skew that can arise in both the map and reduce stages. Moreover, a common approach is adopted to these solutions that predicts and then redistributes the task load to achieve a better balance, which requires additional (sometimes heavy) overhead in terms of key distribution sampling and load reassignment.…”

Section: Resource Management In Hadoop Yarnmentioning

confidence: 99%

“…The CPU resources allocated to a task are determined by the number of vCores allocated to the task. Memory allocation, by contrast, is controlled by two configurations: Logical RAM limit and maximum JVM heap size limit 2 . The former is a unit used to manage the resources logically, while the latter setting reflects the maximum heap size of the JVM that runs the task.…”

Section: Impact Of Resources On Task Running Timementioning

confidence: 99%

“…The majority of these adopt a common approach that estimates the distribution of the intermediate key-value pairs and then reassigns these key-value pairs to tasks. However, for the purpose of predicting the key-value distribution, this will cause a synchronization barrier, as it requires either waiting for all map tasks to be completed [2,3] or adding a sampling procedure prior to the actual job beginning [4][5][6][7]. Some other approaches [8,9] have speculatively launched replica tasks for lagging tasks with the expectation that the former will be completed faster than the latter.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Run-Time Dynamic Resource Adjustment for Mitigating Skew in MapReduce

Liu¹,

Zhang²,

Liu³

et al. 2021

Computer Modeling in Engineering &Amp; Sciences

View full text Add to dashboard Cite

MapReduce is a widely used programming model for large-scale data processing. However, it still suffers from the skew problem, which refers to the case in which load is imbalanced among tasks. This problem can cause a small number of tasks to consume much more time than other tasks, thereby prolonging the total job completion time. Existing solutions to this problem commonly predict the loads of tasks and then rebalance the load among them. However, solutions of this kind often incur high performance overhead due to the load prediction and rebalancing. Moreover, existing solutions target the partitioning skew for reduce tasks, but cannot mitigate the computational skew for map tasks. Accordingly, in this paper, we present DynamicAdjust, a run-time dynamic resource adjustment technique for mitigating skew. Rather than rebalancing the load among tasks, DynamicAdjust monitors the runtime execution of tasks and dynamically increases resources for those tasks that require more computation. In so doing, DynamicAdjust can not only eliminate the overhead incurred by load prediction and rebalancing, but also culls both the partitioning skew and the computational skew. Experiments are conducted based on a 21-node real cluster using real-world datasets. The results show that DynamicAdjust can mitigate the negative impact of the skew and shorten the job completion time by up to 40.85%.

show abstract

Section: Resource Management In Hadoop Yarnmentioning

confidence: 99%

Section: Impact Of Resources On Task Running Timementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Run-Time Dynamic Resource Adjustment for Mitigating Skew in MapReduce

Liu¹,

Zhang²,

Liu³

et al. 2021

Computer Modeling in Engineering &Amp; Sciences

View full text Add to dashboard Cite

show abstract

“…Skew in MapReduce. The first related solutions mitigate reducer skew by measuring the key distribution during the Map operation [33,13]. In general, these methods are not appropriate for long-running streaming tasks with concept drifts in key distribution, nor for stateful operators that require state migration after repartitioning.…”

Section: Related Workmentioning

confidence: 99%

System-aware dynamic partitioning for batch and streaming workloads

Zvara¹,

Szabó²,

Lóránt³

et al. 2021

Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing

View full text Add to dashboard Cite

When processing data streams with highly skewed and nonstationary key distributions, we often observe overloaded partitions when the hash partitioning fails to balance data correctly. To avoid slow tasks that delay the completion of the whole stage of computation, it is necessary to apply adaptive, on-the-fly partitioning that continuously recomputes an optimal partitioner, given the observed key distribution. While such solutions exist for batch processing of static data sets and stateless stream processing, the task is difficult for long-running stateful streaming jobs where key distribution changes over time. Careful checkpointing and operator state migration is necessary to change the partitioning while the operation is running.Our key result is a lightweight on-the-fly Dynamic Repartitioning (DR) module for distributed data processing systems (DDPS), including Apache Spark and Flink, which improves the performance with negligible overhead. DR can adaptively repartition data during execution using our Key Isolator Partitioner (KIP). In our experiments with real workloads and power-law distributions, we reach a speedup of 1.5-6 for a variety of Spark and Flink jobs.

show abstract

“…End Figure 2: Flow chart of short reads gene sequence parallel alignment Partitioner is a means of data distribution provided by Hadoop platform [10]. The partition classes built in the platform, such as HashP artitioner and BinaryP artitioner, are not suitable for the distribution of pair-end sequences.…”

Section: Algorithm 1 the Map Algorithmmentioning

confidence: 99%

Gene Sequences Parallel Alignment Model Based on Multiple Inputs and Outputs

Feng¹,

Gao²

2019

INT J COMPUT COMMUN

View full text Add to dashboard Cite

Bioinformatics computing is a kind of big data processing problem, which usually has the characteristics of large data scale, large computational load and long computational time. Therefore, the use of big data technology in bioinformatics computing has gradually become a research hotspot, and using Hadoop for gene sequence alignment is one of it. It is a common way to use various tools to complete a job in the field of Biocomputing. In most studies of parallel alignment of gene sequences using Hadoop, third-party tools are also needed. However, there are few methods using Hadoop independently to complete gene sequences alignment. Adding data processing with other tools to Hadoop workflow not only affects the improvement of computing performance, but also complicates the application. In this paper, a parallel alignment model of gene sequences based on multiple inputs and outputs is proposed, which can independently complete parallel alignment of gene sequences in Hadoop platform without using other tools. This model not only simplifies the process flow of gene sequence alignment, but also improves the performance compared with other methods. This paper describes in detail the method of manipulating gene sequences with multiple inputs and outputs modes on Hadoop platform and the design of a computing model based on this method, and proves the superiority of this model through experiments.

show abstract

The Partition Cost Model for Load Balancing in MapReduce

Cited by 8 publications

References 13 publications

Run-Time Dynamic Resource Adjustment for Mitigating Skew in MapReduce

Run-Time Dynamic Resource Adjustment for Mitigating Skew in MapReduce

System-aware dynamic partitioning for batch and streaming workloads

Gene Sequences Parallel Alignment Model Based on Multiple Inputs and Outputs

Contact Info

Product

Resources

About