Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning

Yu, Hao; Yang, Sen; Zhu, Shenghuo

doi:10.1609/aaai.v33i01.33015693

Cited by 425 publications

(298 citation statements)

References 10 publications

Supporting

Mentioning

288

Contrasting

Unclassified

Order By: Relevance

“…We emphasize that unlike [YYZ18,Sti19], which only consider local computation, we combine quantization and sparsification with local computation, which poses several technical challenges; e.g., see proofs of Lemma 4, 5, 6.…”

Section: Resultsmentioning

confidence: 99%

“…[WHHZ18] analyzed error compensation for QSGD, without Top k sparsification while focusing on quadratic functions. Another approach for mitigating the communication bottlenecks is by having infrequent communication, which has been popularly referred to in the literature as iterative parameter mixing, see [Cop15], and model averaging, see [Sti19,YYZ18,ZSMR16] and references therein. Our work is most closely related to and builds upon the recent theoretical results in [AHJ + 18, SCJ18,Sti19,YYZ18].…”

Section: Related Workmentioning

confidence: 99%

“…Another approach for mitigating the communication bottlenecks is by having infrequent communication, which has been popularly referred to in the literature as iterative parameter mixing, see [Cop15], and model averaging, see [Sti19,YYZ18,ZSMR16] and references therein. Our work is most closely related to and builds upon the recent theoretical results in [AHJ + 18, SCJ18,Sti19,YYZ18]. [SCJ18] considered the analysis for the centralized Top k (among other sparsifiers), and [AHJ + 18] analyzed a distributed version with the assumption of closeness of the aggregated Top k gradients to the centralized Top k case, see Assumption 1 in [AHJ + 18].…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations

Basu

Data

Karakus

et al. 2020

IEEE J. Sel. Areas Inf. Theory

207

212

View full text Add to dashboard Cite

Communication bottleneck has been identified as a significant issue in distributed optimization of large-scale learning models. Recently, several approaches to mitigate this problem have been proposed, including different forms of gradient compression or computing local models and mixing them iteratively. In this paper we propose Qsparse-local-SGD algorithm, which combines aggressive sparsification with quantization and local computation along with error compensation, by keeping track of the difference between the true and compressed gradients. We propose both synchronous and asynchronous implementations of Qsparse-local-SGD. We analyze convergence for Qsparse-local-SGD in the distributed setting for smooth non-convex and convex objective functions. We demonstrate that Qsparse-local-SGD converges at the same rate as vanilla distributed SGD for many important classes of sparsifiers and quantizers. We use Qsparse-local-SGD to train ResNet-50 on ImageNet, and show that it results in significant savings over the state-of-the-art, in the number of bits transmitted to reach target accuracy.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations

Basu

Data

Karakus

et al. 2020

IEEE J. Sel. Areas Inf. Theory

207

212

View full text Add to dashboard Cite

show abstract

“…Multiple local updates before aggregation is possible in the bound derived in [26], but the number of local updates varies based on the thresholding procedure and cannot be specified as a given constant. Concurrently with our work, bounds with a fixed number of local updates between global aggregation steps are derived in [32], [33]. However, the bound in [32] only works with i.i.d.…”

Section: Related Workmentioning

confidence: 99%

Adaptive Federated Learning in Resource Constrained Edge Computing Systems

Wang

Tuor

Salonidis

et al. 2019

IEEE J. Select. Areas Commun.

1,740

896

View full text Add to dashboard Cite

Emerging technologies and applications including Internet of Things (IoT), social networking, and crowd-sourcing generate large amounts of data at the network edge. Machine learning models are often built from the collected data, to enable the detection, classification, and prediction of future events. Due to bandwidth, storage, and privacy concerns, it is often impractical to send all the data to a centralized location. In this paper, we consider the problem of learning model parameters from data distributed across multiple edge nodes, without sending raw data to a centralized place. Our focus is on a generic class of machine learning models that are trained using gradientdescent based approaches. We analyze the convergence bound of distributed gradient descent from a theoretical point of view, based on which we propose a control algorithm that determines the best trade-off between local update and global parameter aggregation to minimize the loss function under a given resource budget. The performance of the proposed algorithm is evaluated via extensive experiments with real datasets, both on a networked prototype system and in a larger-scale simulated environment. The experimentation results show that our proposed approach performs near to the optimum with various machine learning models and different data distributions.

show abstract

“…By varying the batch size [124,312,373], this method is effective in reducing the communication cost without too much accuracy loss. In the next paragraph, we will discuss more about the parallel SGD algorithms [226,316,375,383,399] for improving the communication efficiency, which can be seen as one way of improving the performance of data parallelism. Another type of data parallel that addresses the memory limit on single GPU is spatial parallelism [167].…”

Section: Distributed Machine Learningmentioning

confidence: 99%

Orchestrating the Development Lifecycle of Machine Learning-based IoT Applications

Qian

Wen

et al. 2020

ACM Comput. Surv.

View full text Add to dashboard Cite

Machine Learning (ML) and Internet of Things (IoT) are complementary advances: ML techniques unlock the potential of IoT with intelligence, and IoT applications increasingly feed data collected by sensors into ML models, thereby employing results to improve their business processes and services. Hence, orchestrating ML pipelines that encompass model training and implication involved in the holistic development lifecycle of an IoT application often leads to complex system integration. This paper provides a comprehensive and systematic survey of the development lifecycle of ML-based IoT applications. We outline the core roadmap and taxonomy, and subsequently assess and compare existing standard techniques used at individual stages.

show abstract

Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning

Cited by 425 publications

References 10 publications

Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations

Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations

Adaptive Federated Learning in Resource Constrained Edge Computing Systems

Orchestrating the Development Lifecycle of Machine Learning-based IoT Applications

Contact Info

Product

Resources

About