Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training

Luo, Qinyi; He, Jiaao; Zhuo, Youwei; Qian, Xuehai

doi:10.1145/3373376.3378499

Cited by 63 publications

(55 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Some advocate less strict parameter synchronization to mitigate the synchronization cost, such as Stale Synchronous Parallel (SSP) [28], Dynamic Synchronous Parallel [48] and Round-Robin Synchronous Parallel [15]. AD-PSGD [29] and Hop [30] are variations of asynchronous and stale synchronous training, which target communication efficiency in heterogeneous environments. Some other studies focus on heterogeneity-aware distributed SGD algorithms, employing a constant learning rate for delayed gradient push in the SSP protocol, to reduce disturbance and unstable convergence caused by stragglers [25].…”

Section: Stragglers In Distributed Model Trainingmentioning

confidence: 99%

“…AllReduce methods [12,33,34] also lack the flexibility to tackle straggler issues, which is more challenging than PS architecture due to the more restrictive communication pattern between workers. There are some works focusing on straggler issues in AllReduce [29,30], but their methods may lead to deadlocks and may not be able to deal with complex straggler patterns, such as transient stragglers.…”

Section: Stragglers In Distributed Model Trainingmentioning

confidence: 99%

See 1 more Smart Citation

Elastic parameter server load distribution in deep learning clusters

Chen

Peng

Bao

et al. 2020

Proceedings of the 11th ACM Symposium on Cloud Computing

View full text Add to dashboard Cite

In distributed DNN training, parameter servers (PS) can become performance bottlenecks due to PS stragglers, caused by imbalanced parameter distribution, bandwidth contention, or computation interference. Few existing studies have investigated efficient parameter (aka load) distribution among PSs. We observe significant training inefficiency with the current parameter assignment in representative machine learning frameworks (e.g., MXNet, TensorFlow), and big potential for training acceleration with better PS load distribution. We design PSLD, a dynamic parameter server load distribution scheme, to mitigate PS straggler issues and accelerate distributed model training in the PS architecture. An exploitation-exploration method is carefully designed to scale in and out parameter servers and adjust parameter distribution among PSs on the go. We also design an elastic PS scaling module to carry out our scheme with little interruption to the training process. We implement our module on top of open-source PS architectures, including MXNet and BytePS. Testbed experiments show up to 2.86x speed-up in model training with PSLD, for different ML models under various straggler settings. CCS CONCEPTS • Computer systems organization → Cloud computing.

show abstract

Section: Stragglers In Distributed Model Trainingmentioning

confidence: 99%

Section: Stragglers In Distributed Model Trainingmentioning

confidence: 99%

Elastic parameter server load distribution in deep learning clusters

Chen

Peng

Bao

et al. 2020

Proceedings of the 11th ACM Symposium on Cloud Computing

View full text Add to dashboard Cite

show abstract

“…Even though the slow workers inevitably have staler parameters, the e ects on the training of the global model can be minimized via the probabilistic solution. However, this strategy requires additional system overhead and manual e orts to generate a dynamic communication graph [39].…”

Section: Algorithm 1 the Decentralized Training Processmentioning

confidence: 99%

“…Several pieces of research have explored the robustness of deep learning processes [37] [39]. In particular, AD-PSGD [37] is the rst that proposes to use randomized communication to reduce the e ects of stragglers probabilistically.…”

Section: Introductionmentioning

confidence: 99%

“…However, this design incurs signi cant synchronization overhead to ensure atomicity. It also requires manual e orts to avoid scheduling con icts [39]. Inspired by the fact that deep learning training process is robust to bounded errors, we propose to relax the global barrier without changing the communication graph to mitigate the impact from "long-tail e ects. "…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation