Comparative Study Load Balance Algorithms for Map Reduce Environment

Hefny, Hesham A.; Khafagy, Mohamed Helmy; Wahdan, Ahmed M.

doi:10.5120/ijais14-451261

Cited by 7 publications

(2 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Partitioning for Map-Reduce Partitioning has been widely studied for Map-Reduce-based processing [7,22,23,25]. While conceptually similar, these approaches either require offline preprocessing of the data and, thus, are not suitable with optimize solely for the map or the reduce phase.…”

Section: Related Workmentioning

confidence: 99%

Dalton

2022

View full text Add to dashboard Cite

To sustain the input rate of high-throughput streams, modern stream processing systems rely on parallel execution. However, skewed data yield imbalanced load assignments and create stragglers that hinder scalability Deciding on a static partitioning for a given set of "hot" keys is not sufficient as these keys are not known in advance, and even worse, the data distribution can change unpredictably. Existing algorithms either optimize for a specific distribution or, in order to adapt, assume a centralized partitioner that processes every incoming tuple and observes the whole workload. However, this is not realistic in a distributed environment, where multiple parallel upstream operators exist, as the centralized partitioner itself becomes the bottleneck and limits scalability In this work, we propose Dalton: a lightweight, adaptive, yet scalable partitioning operator that relies on reinforcement learning. By memoizing state and dynamically keeping track of recent experience, Dalton: i) adjusts its policy at runtime and quickly adapts to the workload, ii) avoids redundant computations and minimizes the per-tuple partitioning overhead, and iii) efficiently scales out to multiple instances that learn cooperatively and converge to a joint policy Our experiments indicate that Dalton scales regardless of the input data distribution and sustains 1.3X - 6.7X higher throughput than existing approaches.

show abstract

Section: Related Workmentioning

confidence: 99%

Dalton

2022

View full text Add to dashboard Cite

show abstract

“…The next MapReduce job reads the intermediate results of the previous job to continue processing. The HDFS I/O cost is significantly higher than local storage (i.e., there is Network cost) that use load balance [22,23]. So, exploiting shared jobs can reduce intermediate results, and can be cheaper than generating too large size of intermediate results in the case of using an original data source for each query separately [24,25].…”

Section: Mapreduce Query Processingmentioning

confidence: 99%