2019
DOI: 10.1609/aaai.v33i01.33015693
|View full text |Cite
|
Sign up to set email alerts
|

Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning

Abstract: In distributed training of deep neural networks, parallel minibatch SGD is widely used to speed up the training process by using multiple workers. It uses multiple workers to sample local stochastic gradient in parallel, aggregates all gradients in a single server to obtain the average, and update each worker's local model using a SGD update with the averaged gradient. Ideally, parallel mini-batch SGD can achieve a linear speed-up of the training time (with respect to the number of workers) compared with SGD o… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

8
288
1
1

Year Published

2019
2019
2020
2020

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 425 publications
(298 citation statements)
references
References 10 publications
8
288
1
1
Order By: Relevance
“…We emphasize that unlike [YYZ18,Sti19], which only consider local computation, we combine quantization and sparsification with local computation, which poses several technical challenges; e.g., see proofs of Lemma 4, 5, 6.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…We emphasize that unlike [YYZ18,Sti19], which only consider local computation, we combine quantization and sparsification with local computation, which poses several technical challenges; e.g., see proofs of Lemma 4, 5, 6.…”
Section: Resultsmentioning
confidence: 99%
“…[WHHZ18] analyzed error compensation for QSGD, without Top k sparsification while focusing on quadratic functions. Another approach for mitigating the communication bottlenecks is by having infrequent communication, which has been popularly referred to in the literature as iterative parameter mixing, see [Cop15], and model averaging, see [Sti19,YYZ18,ZSMR16] and references therein. Our work is most closely related to and builds upon the recent theoretical results in [AHJ + 18, SCJ18,Sti19,YYZ18].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Multiple local updates before aggregation is possible in the bound derived in [26], but the number of local updates varies based on the thresholding procedure and cannot be specified as a given constant. Concurrently with our work, bounds with a fixed number of local updates between global aggregation steps are derived in [32], [33]. However, the bound in [32] only works with i.i.d.…”
Section: Related Workmentioning
confidence: 99%
“…By varying the batch size [124,312,373], this method is effective in reducing the communication cost without too much accuracy loss. In the next paragraph, we will discuss more about the parallel SGD algorithms [226,316,375,383,399] for improving the communication efficiency, which can be seen as one way of improving the performance of data parallelism. Another type of data parallel that addresses the memory limit on single GPU is spatial parallelism [167].…”
Section: Distributed Machine Learningmentioning
confidence: 99%