Proceedings of the Seventh ACM Symposium on Cloud Computing 2016
DOI: 10.1145/2987550.2987554
|View full text |Cite
|
Sign up to set email alerts
|

Addressing the straggler problem for iterative convergent parallel ML

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
56
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 113 publications
(56 citation statements)
references
References 24 publications
0
56
0
Order By: Relevance
“…When scaling the training of deep learning models in distributed clusters, a parameter server (PS) [35] design is the de-facto approach. In contrast to CROSSBOW, which improves the training performance with small batch sizes on a single multi-GPU server, PS-based systems address the challenges of using a cluster for distributed learning, including the handling of elastic and heterogeneous resources [22,27], the mitigation of stragglers [13,11,16], the acceleration of synchronisation using hybrid hardware [12], and the avoidance of resource fragmentation using collective communication [63,56,24]. Similar to prior model averaging systems [69], CROSSBOW could adopt a PS design to manage its average model in a distributed deployment.…”
Section: Related Workmentioning
confidence: 99%
“…When scaling the training of deep learning models in distributed clusters, a parameter server (PS) [35] design is the de-facto approach. In contrast to CROSSBOW, which improves the training performance with small batch sizes on a single multi-GPU server, PS-based systems address the challenges of using a cluster for distributed learning, including the handling of elastic and heterogeneous resources [22,27], the mitigation of stragglers [13,11,16], the acceleration of synchronisation using hybrid hardware [12], and the avoidance of resource fragmentation using collective communication [63,56,24]. Similar to prior model averaging systems [69], CROSSBOW could adopt a PS design to manage its average model in a distributed deployment.…”
Section: Related Workmentioning
confidence: 99%
“…Previous approach [11] waits for a worker to idle before looking to steal work, incurring additional delays until work is found. Recent effort FlexRR [14] identifies slow workers and reassigns the load before fast workers finished. However, these stealing approaches estimate the stolen task load based linear assumption.…”
Section: Related Workmentioning
confidence: 99%
“…Most efforts [11], [12], [13], [14] mitigate skewness based on the assumption that task processing time is linearly dependent on the size of partitioned data. While the assumption may hold true for map stage, it does not work well for reduce stage.…”
Section: Introductionmentioning
confidence: 99%
“…In typical load balancing approaches, there are two basic steps: first to detect stragglers, and second to tackle stragglers by load adjustment. For example, FlexRR [38], a recently proposed load-balancing approach for Machine Learning workloads, detects stragglers by frequently measuring and synchronizing the worker progress at runtime; once stragglers are detected, it immediately offloads some of their training work to other lightly-loaded workers.…”
Section: Load-balancing DL Workloadsmentioning
confidence: 99%
“…That said, existing load balancing approaches [14,17,33,38] are not suitable for deep learning training. They are designed for cases where each iteration is sufficiently long so that stragglers can be detected and tackled, but iterations in deep learning training are quite short in general, lasting for a few seconds or even less than a second.…”
Section: Introductionmentioning
confidence: 99%