2019
DOI: 10.48550/arxiv.1911.08772
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Understanding Top-k Sparsification in Distributed Deep Learning

Abstract: Distributed stochastic gradient descent (SGD) algorithms are widely deployed in training large-scale deep learning models, while the communication overhead among workers becomes the new system bottleneck. Recently proposed gradient sparsification techniques, especially Top-k sparsification with error compensation (TopK-SGD), can significantly reduce the communication traffic without obvious impact on the model accuracy. Some theoretical studies have been carried out to analyze the convergence property of TopK-… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

8
39
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 26 publications
(47 citation statements)
references
References 20 publications
(28 reference statements)
8
39
0
Order By: Relevance
“…Gradient sparsification [2,4,11,16,19,36,[41][42][43]50] is a key approach to lower the communication volume. By top-π‘˜ selection, i.e., only selecting the largest (in terms of the absolute value) π‘˜ of 𝑛 components, the gradient becomes very sparse (commonly around 99%).…”
Section: Background and Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…Gradient sparsification [2,4,11,16,19,36,[41][42][43]50] is a key approach to lower the communication volume. By top-π‘˜ selection, i.e., only selecting the largest (in terms of the absolute value) π‘˜ of 𝑛 components, the gradient becomes very sparse (commonly around 99%).…”
Section: Background and Related Workmentioning
confidence: 99%
“…Then, the accumulated sparse gradient is used in the Stochastic Gradient Descent (SGD) optimizer to update the model parameters, which is called Topπ‘˜ SGD. The convergence of Topπ‘˜ SGD has been theoretically and empirically proved [4,36,41]. However, the parallel scalablity of the existing sparse allreduce algorithms is limited, which makes it very difficult to obtain real performance improvement, especially on the machines (e.g., supercomputers) with high-performance interconnected networks [5,17,37,40].…”
Section: Background and Related Workmentioning
confidence: 99%
See 3 more Smart Citations