2017
DOI: 10.1109/tac.2017.2671377
|View full text |Cite
|
Sign up to set email alerts
|

Data-Dependent Convergence for Consensus Stochastic Optimization

Abstract: We study a distributed consensus-based stochastic gradient descent (SGD) algorithm and show that the rate of convergence involves the spectral properties of two matrices: the standard spectral gap of a weight matrix from the network topology and a new term depending on the spectral norm of the sample covariance matrix of the data. This data-dependent convergence rate shows that distributed SGD algorithms perform better on datasets with small spectral norm. Our analysis method also allows us to find data-depend… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
5
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 13 publications
(5 citation statements)
references
References 17 publications
0
5
0
Order By: Relevance
“…Each node only synchronizes with its neighbors, thus reducing the communication overhead significantly. Decentralized averaging has a long history in the distributed and consensus optimization community (Tsitsiklis et al, 1986;Nedic & Ozdaglar, 2009;Duchi et al, 2012;Tsianos et al, 2012;Zeng & Yin, 2016;Yuan et al, 2016;Sirb & Ye, 2018;Bijral et al, 2017). Most of these works are for gradient descent or dual averaging methods rather than stochastic gradient descent (SGD), and they do not allow workers to make local updates.…”
Section: Introductionmentioning
confidence: 99%
“…Each node only synchronizes with its neighbors, thus reducing the communication overhead significantly. Decentralized averaging has a long history in the distributed and consensus optimization community (Tsitsiklis et al, 1986;Nedic & Ozdaglar, 2009;Duchi et al, 2012;Tsianos et al, 2012;Zeng & Yin, 2016;Yuan et al, 2016;Sirb & Ye, 2018;Bijral et al, 2017). Most of these works are for gradient descent or dual averaging methods rather than stochastic gradient descent (SGD), and they do not allow workers to make local updates.…”
Section: Introductionmentioning
confidence: 99%
“…Many papers (e.g., [41], [42]) use the term in the SA sense described here, with a continuous stream of data in which no sample is used more than once. However, other papers (e.g., [2], [43]) use the term within the ERM framework to describe algorithms that operate on a fixed dataset, from which mini-batches of data are sampled with replacement and noisy gradients are computed. To disambiguate, some authors (e.g., [44]) use the term single-pass SGD to indicate the former usage.…”
Section: ) Stochastic Approximation (Sa)mentioning
confidence: 99%
“…Notably, consensus dynamics provide a foundation for decentralized optimization algorithms, which strategically implement consensus using synchronous or asynchronous gossiping between nodes 26,27 . Such algorithms are often employed to take advantage of distributed computing infrastructure to more efficiently train machine learning models, such as support vector machines 28,29 and deep neural networks [30][31][32][33] . For such systems, each node trains a local model on local data, and at the same time, communication between nodes enables them to reach a consensus on what the model parameter should be.…”
Section: Introductionmentioning
confidence: 99%