Proceedings of the 11th ACM Symposium on Cloud Computing 2020
DOI: 10.1145/3419111.3421299
|View full text |Cite
|
Sign up to set email alerts
|

Semi-dynamic load balancing

Abstract: Machine learning (ML) models are increasingly trained in clusters with non-dedicated workers possessing heterogeneous resources. In such scenarios, model training efficiency can be negatively affected by stragglers-workers that run much slower than others. Efficient model training requires eliminating such stragglers, yet for modern ML workloads, existing load balancing strategies are inefficient and even infeasible. In this paper, we propose a novel strategy called semi-dynamic load balancing to eliminate str… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
2
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 20 publications
(2 citation statements)
references
References 47 publications
0
2
0
Order By: Relevance
“…This generalpurpose data partitioning framework performs accurate and efficient benchmarking to obtain the relative speed of the resources that constitute the cluster, providing the load measurements for each element that optimize execution time. Also, other criteria could be taken into account to determine these measurements [5]. In this particular case, the resource speed is used to define the heterogeneity of the platform, as explained in the following Sect.…”
Section: Hetgrad Optimization Methodologymentioning
confidence: 99%
See 1 more Smart Citation
“…This generalpurpose data partitioning framework performs accurate and efficient benchmarking to obtain the relative speed of the resources that constitute the cluster, providing the load measurements for each element that optimize execution time. Also, other criteria could be taken into account to determine these measurements [5]. In this particular case, the resource speed is used to define the heterogeneity of the platform, as explained in the following Sect.…”
Section: Hetgrad Optimization Methodologymentioning
confidence: 99%
“…Regarding workload distribution in data-parallelism scheme, a dynamic workload distribution scheme is proposed in [5], to adapt the assigned batch size to each replica in every iteration. A recurrent neural network (RNN) is used in order to measure the speed of each replica.…”
mentioning
confidence: 99%