Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Syste 2020
DOI: 10.1145/3373376.3378499
|View full text |Cite
|
Sign up to set email alerts
|

Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training

Abstract: Distributed deep learning training usually adopts All-Reduce as the synchronization mechanism for data parallel algorithms due to its high performance in homogeneous environment. However, its performance is bounded by the slowest worker among all workers, and is significantly slower in heterogeneous situations. AD-PSGD, a newly proposed synchronization method which provides numerically fast convergence and heterogeneity tolerance, suffers from deadlock issues and high synchronization overhead. Is it possible t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
54
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 63 publications
(55 citation statements)
references
References 34 publications
1
54
0
Order By: Relevance
“…Some advocate less strict parameter synchronization to mitigate the synchronization cost, such as Stale Synchronous Parallel (SSP) [28], Dynamic Synchronous Parallel [48] and Round-Robin Synchronous Parallel [15]. AD-PSGD [29] and Hop [30] are variations of asynchronous and stale synchronous training, which target communication efficiency in heterogeneous environments. Some other studies focus on heterogeneity-aware distributed SGD algorithms, employing a constant learning rate for delayed gradient push in the SSP protocol, to reduce disturbance and unstable convergence caused by stragglers [25].…”
Section: Stragglers In Distributed Model Trainingmentioning
confidence: 99%
See 1 more Smart Citation
“…Some advocate less strict parameter synchronization to mitigate the synchronization cost, such as Stale Synchronous Parallel (SSP) [28], Dynamic Synchronous Parallel [48] and Round-Robin Synchronous Parallel [15]. AD-PSGD [29] and Hop [30] are variations of asynchronous and stale synchronous training, which target communication efficiency in heterogeneous environments. Some other studies focus on heterogeneity-aware distributed SGD algorithms, employing a constant learning rate for delayed gradient push in the SSP protocol, to reduce disturbance and unstable convergence caused by stragglers [25].…”
Section: Stragglers In Distributed Model Trainingmentioning
confidence: 99%
“…AllReduce methods [12,33,34] also lack the flexibility to tackle straggler issues, which is more challenging than PS architecture due to the more restrictive communication pattern between workers. There are some works focusing on straggler issues in AllReduce [29,30], but their methods may lead to deadlocks and may not be able to deal with complex straggler patterns, such as transient stragglers.…”
Section: Stragglers In Distributed Model Trainingmentioning
confidence: 99%
“…Even though the slow workers inevitably have staler parameters, the e ects on the training of the global model can be minimized via the probabilistic solution. However, this strategy requires additional system overhead and manual e orts to generate a dynamic communication graph [39].…”
Section: Algorithm 1 the Decentralized Training Processmentioning
confidence: 99%
“…Several pieces of research have explored the robustness of deep learning processes [37] [39]. In particular, AD-PSGD [37] is the rst that proposes to use randomized communication to reduce the e ects of stragglers probabilistically.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation