Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2020
DOI: 10.1145/3332466.3374528
|View full text |Cite
|
Sign up to set email alerts
|

Taming unbalanced training workloads in deep learning with partial collective operations

Abstract: Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself. Traditional synchronous Stochastic Gradient Descent (SGD) achieves good accuracy for a wide variety of tasks, but relies on global synchronization to accumulate the gradients at every training step. In this paper, we propose eager-SGD, which relaxes the global synchronization for decentralized accumulation. To implement eager-SGD, we propose to use t… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
28
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 47 publications
(28 citation statements)
references
References 45 publications
0
28
0
Order By: Relevance
“…Those will experience variability in both execution time and memory consumption. It has also been recently observed in machine learning framework on GPUs [32].…”
Section: Related Workmentioning
confidence: 79%
“…Those will experience variability in both execution time and memory consumption. It has also been recently observed in machine learning framework on GPUs [32].…”
Section: Related Workmentioning
confidence: 79%
“…To scale up the training process to parallel machines, data parallelism [18,25,26,38,52,53] is the common method, in which the mini-batch is partitioned among 𝑃 workers and each worker maintains a copy of the entire model. Gradient accumulation across 𝑃 workers is often implemented using a standard dense allreduce [12], leading to about 2𝑛 communication volume where 𝑛 is the number of gradient components (equal to the number of model parameters).…”
Section: Background and Related Workmentioning
confidence: 99%
“…We compare the performance of RNA with three other synchronization models: Horovod [49], AD-PSGD [37], and eager-SGD [35]. Horovod is selected as the state-of-the-art baseline, which significantly outperforms many other implementations of All-Reduce.…”
Section: Approaches and Performance Metricsmentioning
confidence: 99%
“…Prague [39] and Eager-SGD [35] are more related to our approach, which proposes a new communication primitive to allow partial workers to synchronize parameters quickly. Speci cally, Prague o ers both static and dynamic group scheduling to construct a new group randomly during the runtime to avoid con icts.…”
Section: Related Workmentioning
confidence: 99%