2021
DOI: 10.48550/arxiv.2104.11981
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

DecentLaM: Decentralized Momentum SGD for Large-batch Deep Training

Abstract: The scale of deep learning nowadays calls for efficient distributed training algorithms. Decentralized momentum SGD (DmSGD), in which each node averages only with its neighbors, is more communication efficient than vanilla Parallel momentum SGD that incurs global average across all computing nodes. On the other hand, the large-batch training has been demonstrated critical to achieve runtime speedup. This motivates us to investigate how DmSGD performs in the large-batch scenario.In this work, we find the moment… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
11
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
2

Relationship

2
3

Authors

Journals

citations
Cited by 6 publications
(11 citation statements)
references
References 34 publications
0
11
0
Order By: Relevance
“…The special case h(x) = 0 of Problem (1) has been relatively well-studied. For this smooth formulation, variants of decentralized stochastic gradient descent (DSGD), e.g., [4,26,52,70], admit simple implementations yet provide competitive practical performance against centralized methods in homogeneous environments like data centers. When the data distributions across the network become heterogeneous, the performance of DSGD in both practice and theory degrades significantly [15,39,57,59,68].…”
Section: Literature Reviewmentioning
confidence: 99%
See 1 more Smart Citation
“…The special case h(x) = 0 of Problem (1) has been relatively well-studied. For this smooth formulation, variants of decentralized stochastic gradient descent (DSGD), e.g., [4,26,52,70], admit simple implementations yet provide competitive practical performance against centralized methods in homogeneous environments like data centers. When the data distributions across the network become heterogeneous, the performance of DSGD in both practice and theory degrades significantly [15,39,57,59,68].…”
Section: Literature Reviewmentioning
confidence: 99%
“…This cooperative minimization paradigm, built upon local communication and computation, has numerous applications in estimation, control, adaptation, and learning problems that frequently arise in multi-agent systems [8,17,31,57]. In particular, the sparse and localized peer-to-peer information exchange pattern in decentralized networks substantially reduces the communication overhead on the parameter server in the centralized networks, thus making decentralized optimization algorithms especially appealing in large-scale data analytics and machine learning tasks [4,26,70].…”
Section: Introductionmentioning
confidence: 99%
“…There are many variants of decentralized momentum SGD [3,20,32,67]. This paper will focus on the one proposed by [64] (listed in Algorithm 1), which imposes an additional partialaveraging over the momentum to achieve further speed up.…”
Section: Decentralized Momentum Sgd (Dmsgd)mentioning
confidence: 99%
“…In the deep learning regime, decentralize SGD, which was established in [30] to achieve the same linear speedup as parallel SGD in convergence rate, has attracted a lot of attentions. Many efforts have been made to extend the algorithm to directed topologies [3,42], time-varying topologies [25,42], asynchronous settings [31], and data-heterogeneous scenarios [57,62,32,67]. Techniques such as quantization/compression [2,8,26,24,58,36], periodic updates [55,25,64], and lazy communication [37,38,13] were also integrated into decentralized SGD to further reduce communiation overheads.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation