2020
DOI: 10.48550/arxiv.2003.06307
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Communication-Efficient Distributed Deep Learning: A Comprehensive Survey

Abstract: Distributed deep learning becomes very common to reduce the overall training time by exploiting multiple computing devices (e.g., GPUs/TPUs) as the size of deep models and data sets increases. However, data communication between computing devices could be a potential bottleneck to limit the system scalability. How to address the communication problem in distributed deep learning is becoming a hot research topic recently. In this paper, we provide a comprehensive survey of the communication-efficient distribute… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
23
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 25 publications
(23 citation statements)
references
References 120 publications
0
23
0
Order By: Relevance
“…We now show the convergence of loss function. In the following we show that the converged parameter in (40) guarantees the convergence of our loss function. We denote the r.h.s of (40) with a fixed step size η as…”
Section: A2 Proof Of Theoremmentioning
confidence: 76%
See 2 more Smart Citations
“…We now show the convergence of loss function. In the following we show that the converged parameter in (40) guarantees the convergence of our loss function. We denote the r.h.s of (40) with a fixed step size η as…”
Section: A2 Proof Of Theoremmentioning
confidence: 76%
“…Assumption 3 is widely used in convergence results of gradient methods, e.g., [25,41,43,44]. Assumption 4 is also a standard assumption [40,45]. We use the bound σ jL to quantify the heterogeneity of the non-i.i.d.…”
Section: Assumptionsmentioning
confidence: 99%
See 1 more Smart Citation
“…In [60] authors study several aspects of distributed learning and provide a comprehensive survey of both theoretical and practical aspect of distributed machine learning. Perhaps closest to our work is [73] where authors perform a detailed study as to whether network is the bottleneck in distributed training.…”
Section: Related Workmentioning
confidence: 99%
“…The goal is to maintain the achieved performance while reducing communications or computation overheads. To improve communications efficiency [1], [2], coding or model compression schemes (e.g., model pruning, model quantization) can be applied on model parameter transmissions in step (i) and model update transmissions in step (iii). However, these schemes work after local model training and cannot improve the computation efficiency, and they also create computation burden by performing additional compression schemes.…”
Section: B Distributed Deep Training Strategiesmentioning
confidence: 99%