Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms

Wang, Jianyu; Joshi, Gauri

doi:10.48550/arxiv.1808.07576

Cited by 103 publications

(200 citation statements)

References 28 publications

Supporting

Mentioning

194

Contrasting

Order By: Relevance

“…In this section, we provide some auxiliary results for the proof of Theorem 1. We first give an alternative form of the reconstruction error derived from the condition (7) and the performance guarantee (6). Lemma 3.…”

Section: A Auxiliary Resultsmentioning

confidence: 99%

“…The first category aims to reduce the number of communication rounds, based on the idea that each edge device runs multiple local SGD steps in parallel before sending the local updates to the server for aggregation. This approach has also been called FedAvg [1] in federated learning and convergence has been studied in [5,6,7]. Another line of work investigates lazy/adaptive upload of information, i.e., local gradients are uploaded only when found to be informative enough [8].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Communication-Efficient Distributed SGD with Compressed Sensing

Tang,

Ramanathan,

Zhang

et al. 2021

Preprint

View full text Add to dashboard Cite

We consider large scale distributed optimization over a set of edge devices connected to a central server, where the limited communication bandwidth between the server and edge devices imposes a significant bottleneck for the optimization procedure. Inspired by recent advances in federated learning, we propose a distributed stochastic gradient descent (SGD) type algorithm that exploits the sparsity of the gradient, when possible, to reduce communication burden. At the heart of the algorithm is to use compressed sensing techniques for the compression of the local stochastic gradients at the device side; and at the server side, a sparse approximation of the global stochastic gradient is recovered from the noisy aggregated compressed local gradients. We conduct theoretical analysis on the convergence of our algorithm in the presence of noise perturbation incurred by the communication channels, and also conduct numerical experiments to corroborate its effectiveness.

show abstract

Section: A Auxiliary Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Communication-Efficient Distributed SGD with Compressed Sensing

Tang,

Ramanathan,

Zhang

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…FedAvg is able to reduce communication costs by training clients for multiple rounds locally. Several works have shown the convergence of FedAvg under several different settings with both homogeneous (IID) data [37,41] and heterogeneous (non-IID) data [23,3,44] even with partial clients participation. Specifically, [44] demonstrated LocalSGD achieves O( 1 √ N Q ) convergence for non-convex optimization and [23] established a convergence rate of O( 1Q ) for strongly convex problems on FedAvg, where Q is the number of SGDs and N is the number of participated clients.…”

Section: Related Workmentioning

confidence: 99%

“…However, the success of these algorithms has only been demonstrated empirically (e.g., [6,13]). Unlike standard FL that has received rigorous theoretical analysis [37,3,44,23], the convergence of heterogeneous FL with adaptive online model pruning is still an open question. Little is known about whether such algorithms converge to a solution of standard FL.…”

Section: Introductionmentioning

confidence: 99%

On the Convergence of Heterogeneous Federated Learning with Arbitrary Adaptive Online Model Pruning

Zhou¹,

Lan²,

Venkataramani³

et al. 2022

Preprint

View full text Add to dashboard Cite

One of the biggest challenges in Federated Learning (FL) is that client devices often have drastically different computation and communication resources for local updates. To this end, recent research efforts have focused on training heterogeneous local models obtained by pruning a shared global model. Despite empirical success, theoretical guarantees on convergence remain an open question. In this paper, we present a unifying framework for heterogeneous FL algorithms with arbitrary adaptive online model pruning and provide a general convergence analysis. In particular, we prove that under certain sufficient conditions and on both IID and non-IID data, these algorithms converges to a stationary point of standard FL for general smooth cost functions, with a convergence rate of O( 1 √ Q ). Moreover, we illuminate two key factors impacting convergence: pruninginduced noise and minimum coverage index, advocating a joint design of local pruning masks for efficient training.

show abstract

“…In particular, the iteration complexity and convergence of FedAve are carefully analyzed in [20]. More generally, a unified analysis of the class of communication-efficient SGD algorithms is presented in [23]. Various other federated optimization methods have also been proposed that address different drawbacks of FedAve.…”

Section: ) Communication Costmentioning

confidence: 99%

Federated Optimization of Smooth Loss Functions

Jadbabaie¹,

Makur²,

Shah³

2022

Preprint

View full text Add to dashboard Cite

In this work, we study empirical risk minimization (ERM) within a federated learning framework, where a central server seeks to minimize an ERM objective function using n samples of training data that is stored across m clients and the server. The recent flurry of research in this area has identified the Federated Averaging (FedAve) algorithm as the staple for determining -approximate solutions to the ERM problem. Similar to standard optimization algorithms, e.g., stochastic gradient descent, the convergence analysis of FedAve and its variants only relies on smoothness of the loss function in the optimization parameter. However, loss functions are often very smooth in the training data too. To exploit this additional smoothness in data in a federated learning context, we propose the Federated Low Rank Gradient Descent (FedLRGD) algorithm. Since smoothness in data induces an approximate low rank structure on the gradient of the loss function, our algorithm first performs a few rounds of communication between the server and clients to learn weights that the server can use to approximate clients' gradients using its own gradients. Then, our algorithm solves the ERM problem at the server using an inexact gradient descent method. To theoretically demonstrate that FedLRGD can have superior performance to FedAve, we present a notion of federated oracle complexity as a counterpart to canonical oracle complexity in the optimization literature. Under some assumptions on the loss function, e.g., strong convexity and smoothness in the parameter, η-Hölder class smoothness in the data, etc., we prove that the federated oracle complexity of FedLRGD scales like φm(p/ ) Θ(d/η) and that of FedAve scales like φm(p/ ) 3/4 (neglecting typically sub-dominant factors), where φ 1 is the ratio of client-to-server communication time to gradient computation time, p is the parameter dimension, and d is the data dimension. Then, we show that when d is small compared to n and the loss function is sufficiently smooth in the data, i.e., η = Θ(d), FedLRGD beats FedAve in federated oracle complexity. Finally, in the course of analyzing FedLRGD, we also establish a general result on low rank approximation of smooth latent variable models.

show abstract

Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms

Cited by 103 publications

References 28 publications

Communication-Efficient Distributed SGD with Compressed Sensing

Communication-Efficient Distributed SGD with Compressed Sensing

On the Convergence of Heterogeneous Federated Learning with Arbitrary Adaptive Online Model Pruning

Federated Optimization of Smooth Loss Functions

Contact Info

Product

Resources

About