On the Utility of Gradient Compression in Distributed Training Systems

Agarwal, Saurabh; Wang, Hongyi; Venkataraman, Shivaram; Papailiopoulos, Dimitris

doi:10.48550/arxiv.2103.00543

Cited by 2 publications

(3 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, many distributed systems use all-reduce style techniques to aggregate gradients or local model updates across compute nodes [198]. In such systems, compression techniques that are not compatible with all-reduce may provide less communicationefficiency, despite their higher compression ratio [5,253,258]. In such settings, it is important to make sure that the compression operation commutes with addition.…”

Section: Employ Compression Techniquesmentioning

confidence: 99%

“…In particular, if we force all clients to use τ -fold larger mini-batch size in synchronous SGD algorithm, then this algorithm can also save τ times communication, which is the same as local-update algorithms. We refer to this algorithm as large-batch synhcronous SGD 5 . It has been well understood (see [66]) that the worst-case error of large-batch synchronous SGD is:…”

Section: Savings In Communication Roundsmentioning

confidence: 99%

See 1 more Smart Citation

A Field Guide to Federated Optimization

Wang¹,

Charles²,

Xu³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Federated learning and analytics are a distributed approach for collaboratively learning models (or statistics) from decentralized data, motivated by and designed for privacy protection. The distributed learning process can be formulated as solving federated optimization problems, which emphasize communication efficiency, data heterogeneity, compatibility with privacy and system requirements, and other constraints that are not primary considerations in other problem settings. This paper provides recommendations and guidelines on formulating, designing, evaluating and analyzing federated optimization algorithms through concrete examples and practical implementation, with a focus on conducting effective simulations to infer real-world performance. The goal of this work is not to survey the current literature, but to inspire researchers and practitioners to design federated learning algorithms that can be used in various practical applications.

show abstract

Section: Employ Compression Techniquesmentioning

confidence: 99%

Section: Savings In Communication Roundsmentioning

confidence: 99%

A Field Guide to Federated Optimization

Wang¹,

Charles²,

Xu³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…An efficient solution should significantly reduce the total number of bits used in a given transmission without impacting rapid convergence to a good estimate of θ. In order to provide an end-to-end speedup, a list of properties that compression methods should satisfy are proposed in [27]. Quantization is a popular data compression approach which aims to approximate some quantity using a smaller number of bits to simplify processing, storage and analysis.…”

Section: A Related Workmentioning

confidence: 99%

Distributed Learning With Sparsified Gradient Differences

Chen¹,

Blum²,

Takáč³

et al. 2022

Preprint

View full text Add to dashboard Cite

A very large number of communications are typically required to solve distributed learning tasks, and this critically limits scalability and convergence speed in wireless communications applications. In this paper, we devise a Gradient Descent method with Sparsification and Error Correction (GD-SEC) to improve the communications efficiency in a general worker-server architecture. Motivated by a variety of wireless communications learning scenarios, GD-SEC reduces the number of bits per communication from worker to server with no degradation in the order of the convergence rate. This enables larger scale model learning without sacrificing convergence or accuracy. At each iteration of GD-SEC, instead of directly transmitting the entire gradient vector, each worker computes the difference between its current gradient and a linear combination of its previously transmitted gradients, and then transmits the sparsified gradient difference to the server. A key feature of GD-SEC is that any given component of the gradient difference vector will not be transmitted if its magnitude is not sufficiently large. An error correction technique is used at each worker to compensate for the error resulting from sparsification. We prove that GD-SEC is guaranteed to converge for strongly convex, convex, and nonconvex optimization problems with the same order of convergence rate as GD. Furthermore, if the objective function is strongly convex, GD-SEC has a fast linear convergence rate. Numerical results not only validate the convergence rate of GD-SEC but also explore the communication bit savings it provides. Given a target accuracy, GD-SEC can significantly reduce the communications load compared to the best existing algorithms without slowing down the optimization process.

show abstract

On the Utility of Gradient Compression in Distributed Training Systems

Cited by 2 publications

References 38 publications

A Field Guide to Federated Optimization

A Field Guide to Federated Optimization

Distributed Learning With Sparsified Gradient Differences

Contact Info

Product

Resources

About