DecentLaM: Decentralized Momentum SGD for Large-batch Deep Training

Yuan, Kun; Chen, Yiming; Huang, Xinmeng; Pan, Pan; Xu, Yinghui; Yin, Wotao

doi:10.48550/arxiv.2104.11981

Cited by 6 publications

(11 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The special case h(x) = 0 of Problem (1) has been relatively well-studied. For this smooth formulation, variants of decentralized stochastic gradient descent (DSGD), e.g., [4,26,52,70], admit simple implementations yet provide competitive practical performance against centralized methods in homogeneous environments like data centers. When the data distributions across the network become heterogeneous, the performance of DSGD in both practice and theory degrades significantly [15,39,57,59,68].…”

Section: Literature Reviewmentioning

confidence: 99%

“…This cooperative minimization paradigm, built upon local communication and computation, has numerous applications in estimation, control, adaptation, and learning problems that frequently arise in multi-agent systems [8,17,31,57]. In particular, the sparse and localized peer-to-peer information exchange pattern in decentralized networks substantially reduces the communication overhead on the parameter server in the centralized networks, thus making decentralized optimization algorithms especially appealing in large-scale data analytics and machine learning tasks [4,26,70].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Stochastic Proximal Gradient Framework for Decentralized Non-Convex Composite Optimization: Topology-Independent Sample Complexity and Communication Efficiency

Xin,

Das,

Khan

et al. 2021

Preprint

View full text Add to dashboard Cite

Decentralized optimization is a promising parallel computation paradigm for large-scale data analytics and machine learning problems defined over a network of nodes. This paper is concerned with decentralized non-convex composite problems with population or empirical risk. In particular, the networked nodes are tasked to find an approximate stationary point of the average of local, smooth, possibly non-convex risk functions plus a possibly non-differentiable extended valued convex regularizer. Under this general formulation, we propose the first provably efficient, stochastic proximal gradient framework, called ProxGT. Specifically, we construct and analyze several instances of ProxGT that are tailored respectively for different problem classes of interest. Remarkably, we show that the sample complexities of these instances are network topology-independent and achieve linear speedups compared to that of the corresponding centralized optimal methods implemented on a single node. Contents

show abstract

Section: Literature Reviewmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A Stochastic Proximal Gradient Framework for Decentralized Non-Convex Composite Optimization: Topology-Independent Sample Complexity and Communication Efficiency

Xin,

Das,

Khan

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…There are many variants of decentralized momentum SGD [3,20,32,67]. This paper will focus on the one proposed by [64] (listed in Algorithm 1), which imposes an additional partialaveraging over the momentum to achieve further speed up.…”

Section: Decentralized Momentum Sgd (Dmsgd)mentioning

confidence: 99%

“…In the deep learning regime, decentralize SGD, which was established in [30] to achieve the same linear speedup as parallel SGD in convergence rate, has attracted a lot of attentions. Many efforts have been made to extend the algorithm to directed topologies [3,42], time-varying topologies [25,42], asynchronous settings [31], and data-heterogeneous scenarios [57,62,32,67]. Techniques such as quantization/compression [2,8,26,24,58,36], periodic updates [55,25,64], and lazy communication [37,38,13] were also integrated into decentralized SGD to further reduce communiation overheads.…”

Section: Related Workmentioning

confidence: 99%

“…In addition to the DmSGD algorithm (Algorithm 1) studied in this paper, we also examine how exponential graphs perform with other commonly-used decentralized momentum method: the vanilla DmSGD [3] which does not exchange momentum between neighbors, and QG-DmSGD [32] which adds a quasi-global momentum to relieve the influence of data heterogeneity. We do not examine DecentLaM [67] and D 2 [57] because both methods require symmetric weight matrix during the training process which exponential graphs cannot provide. We also list the performance of parallel SGD using global averaging as one baseline.…”

Section: One-peer Exponential Graph Vs Static Exponential Graphmentioning

confidence: 99%

See 1 more Smart Citation

Exponential Graph is Provably Efficient for Decentralized Deep Training

Ying¹,

Yuan²,

Chen³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Decentralized SGD is an emerging training method for deep learning known for its much less (thus faster) communication per iteration, which relaxes the averaging step in parallel SGD to inexact averaging. The less exact the averaging is, however, the more the total iterations the training needs to take. Therefore, the key to making decentralized SGD efficient is to realize nearly-exact averaging using little communication. This requires a skillful choice of communication topology, which is an under-studied topic in decentralized optimization. In this paper, we study so-called exponential graphs where every node is connected to O(log(n)) neighbors and n is the total number of nodes. This work proves such graphs can lead to both fast communication and effective averaging simultaneously. We also discover that a sequence of log(n) one-peer exponential graphs, in which each node communicates to one single neighbor per iteration, can together achieve exact averaging. This favorable property enables one-peer exponential graph to average as effective as its static counterpart but communicates more efficiently. We apply these exponential graphs in decentralized (momentum) SGD to obtain the state-of-the-art balance between per-iteration communication and iteration complexity among all commonly-used topologies. Experimental results on a variety of tasks and models demonstrate that decentralized (momentum) SGD over exponential graphs promises both fast and highquality training. Our code is implemented through BlueFog and available at https://github.com/Bluefog-Lib/NeurIPS2021-Exponential-Graph.

show abstract

When Decentralized Optimization Meets Federated Learning

Gao¹,

Thai²,

Wu³

2023

IEEE Network

View full text Add to dashboard Cite

Federated learning is a new learning paradigm for extracting knowledge from distributed data. Due to its favorable properties in preserving privacy and saving communication costs, it has been extensively studied and widely applied to numerous data analysis applications. However, most existing federated learning approaches concentrate on the centralized setting, which is vulnerable to a single-point failure. An alternative strategy for addressing this issue is the decentralized communication topology. In this article, we systematically investigate the challenges and opportunities when renovating decentralized optimization for federated learning. In particular, we discussed them from the model, data, and communication sides, respectively, which can deepen our understanding about decentralized federated learning.

show abstract

DecentLaM: Decentralized Momentum SGD for Large-batch Deep Training

Cited by 6 publications

References 34 publications

A Stochastic Proximal Gradient Framework for Decentralized Non-Convex Composite Optimization: Topology-Independent Sample Complexity and Communication Efficiency

A Stochastic Proximal Gradient Framework for Decentralized Non-Convex Composite Optimization: Topology-Independent Sample Complexity and Communication Efficiency

Exponential Graph is Provably Efficient for Decentralized Deep Training

When Decentralized Optimization Meets Federated Learning

Contact Info

Product

Resources

About