Perturbed Iterate Analysis for Asynchronous Stochastic Optimization

Mania, Horia; Pan, Xinghao; Papailiopoulos, Dimitris S.; Recht, Benjamin; Ramchandran, Kannan; Jordan, Michael I.

doi:10.48550/arxiv.1507.06970

Cited by 36 publications

(56 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Proof. This is the first step of the perturbed iterate analysis framework Mania et al [2015]. We follow the steps as in Stich et al [2018].…”

Section: A1 Proof Of the Main Theoremmentioning

confidence: 99%

Communication-efficient distributed SGD with Sketching

Ivkin¹,

Rothchild²,

Ullah³

et al. 2019

Preprint

View full text Add to dashboard Cite

Large-scale distributed training of neural networks is often limited by network bandwidth, wherein the communication time overwhelms the local computation time. Motivated by the success of sketching methods in sub-linear/streaming algorithms, we introduce SKETCHED-SGD 4 , an algorithm for carrying out distributed SGD by communicating sketches instead of full gradients. We show that SKETCHED-SGD has favorable convergence rates on several classes of functions. When considering all communication -both of gradients and of updated model weights -SKETCHED-SGD reduces the amount of communication required compared to other gradient compression methods from O(d) or O(W ) to O(log d), where d is the number of model parameters and W is the number of workers participating in training. We run experiments on a transformer model, an LSTM, and a residual network, demonstrating up to a 40x reduction in total communication cost with no loss in final model performance. We also show experimentally that SKETCHED-SGD scales to at least 256 workers without increasing communication cost or degrading model performance.

show abstract

“…Proof. This is the first step of the perturbed iterate analysis framework Mania et al [2015]. We follow the steps as in Stich et al [2018].…”

Section: A1 Proof Of the Main Theoremmentioning

confidence: 99%

Communication-efficient distributed SGD with Sketching

Ivkin¹,

Rothchild²,

Ullah³

et al. 2019

Preprint

View full text Add to dashboard Cite

show abstract

“…All the transmitters/receivers are equipped with 4 antenna; we simulated uncorrelated fading channels, whose coefficients are Gaussian distributed with zero mean and variance 1/d 3 i j (all the channel matrices are full-column rank); and we set R n i = σ 2 I for all i, and snr p/σ 2 = 3dB. In MIMO-SR-FLEXA, we used the step-size rule (108), with ε = 1e-5; in (153) we set τ i = 0 and computed Qi (Q k ) using the closed form solution in [123]. All the algorithms reach the same average sum-rate.…”

Section: Sum-rate Maximization Over Mimo Interference Channelsmentioning

confidence: 99%

“…Although asynchronous block-methods have a long history (see, e.g., [5,16,45,87,237]), in the past few years, the study of asynchronous parallel optimization methods has witnessed a revival of interest. Indeed, asynchronous parallelism has been applied to many state-of-the-art optimization algorithms (mainly for convex objective functions and constraints), including stochastic gradient methods [109,137,144,153,167,184,195] and ADMM-like schemes [105,110,247]. The asynchronous counterpart of BCD methods has been introduced and studied in the seminal work [146], which motivated and oriented much of subsequent research in the field, see e.g.…”

Section: Ii7 Sources and Notesmentioning

confidence: 99%

“…In [37] a more general, and sophisticated probabilistic model describing the statistics of (i k ; d k ) was introduced, and convergence of the asynchronous parallel SCA method ( 165)-( 166) established; theoretical complexity results were also provided, showing nearly ideal linear speedup when the number of workers is not too large. The new model in [37] neither postulates the independence between i k and d k nor requires artificial changes in the algorithm to enforce it (like those recently proposed in the probabilistic models [137,153,184] used in stochastic gradient methods); it handles instead the potential dependency among variables directly, fixing thus the theoretical issues that mar most of the aforementioned papers. It also lets one analyze for the first time in a sound way several practically used and effective computing settings and new models of asynchrony.…”

Section: Ii7 Sources and Notesmentioning

confidence: 99%

See 1 more Smart Citation

Parallel and Distributed Successive Convex Approximation Methods for Big-Data Optimization

Scutari¹,

Sun²

2018

Preprint

View full text Add to dashboard Cite

“…A large number of recent studies revisited the idea of low-precision training as a means to reduce communication (Seide et al, 2014;De Sa et al, 2015;Alistarh et al, 2017;Zhou et al, 2016;Wen et al, 2017;Zhang et al, 2017;De Sa et al, 2017;Bernstein et al, 2018a;. Other approaches for low-communication training focus on sparsification of gradients, either by thresholding small entries or by random sampling (Strom, 2015;Mania et al, 2015;Suresh et al, 2016;Leblond et al, 2016;Aji & Heafield, 2017;Lin et al, 2017;Chen et al, 2017;Renggli et al, 2018;Tsuzuku et al, 2018;Wang et al, 2018;Vogels et al, 2019).…”

Section: Introductionmentioning

confidence: 99%

Pufferfish: Communication-efficient Models At No Extra Cost

Wang¹,

Agarwal²,

Papailiopoulos³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

To mitigate communication overheads in distributed model training, several studies propose the use of compressed stochastic gradients, usually achieved by sparsification or quantization. Such techniques achieve high compression ratios, but in many cases incur either significant computational overheads or some accuracy loss. In this work, we present PUFFERFISH, a communication and computation efficient distributed training framework that incorporates the gradient compression into the model training process via training low-rank, pre-factorized deep networks. PUFFERFISH not only reduces communication, but also completely bypasses any computation overheads related to compression, and achieves the same accuracy as state-of-the-art, off-the-shelf deep models. PUFFERFISH can be directly integrated into current deep learning frameworks with minimum implementation modification. Our extensive experiments over real distributed setups, across a variety of large-scale machine learning tasks, indicate that PUFFERFISH achieves up to 1.64× end-to-end speedup over the latest distributed training API in PyTorch without accuracy loss. Compared to the Lottery Ticket Hypothesis models, PUFFERFISH leads to equally accurate, small-parameter models while avoiding the burden of "winning the lottery". PUFFERFISH also leads to more accurate and smaller models than SOTA structured model pruning methods.

show abstract

Perturbed Iterate Analysis for Asynchronous Stochastic Optimization

Cited by 36 publications

References 0 publications

Communication-efficient distributed SGD with Sketching

Communication-efficient distributed SGD with Sketching

Parallel and Distributed Successive Convex Approximation Methods for Big-Data Optimization

Pufferfish: Communication-efficient Models At No Extra Cost

Contact Info

Product

Resources

About