Communication-Efficient Distributed Deep Learning: A Comprehensive Survey

Tang, Zhenheng; Shi, Shaohuai; Chu, Xiaowen; Wang, Wei; Li, Bo

doi:10.48550/arxiv.2003.06307

Cited by 25 publications

(23 citation statements)

References 120 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We now show the convergence of loss function. In the following we show that the converged parameter in (40) guarantees the convergence of our loss function. We denote the r.h.s of (40) with a fixed step size η as…”

Section: A2 Proof Of Theoremmentioning

confidence: 76%

“…Assumption 3 is widely used in convergence results of gradient methods, e.g., [25,41,43,44]. Assumption 4 is also a standard assumption [40,45]. We use the bound σ jL to quantify the heterogeneity of the non-i.i.d.…”

Section: Assumptionsmentioning

confidence: 99%

“…Generalizing the PS schemes to our fully distributed framework is a non-trivial and open problem because of the large scale distributed and heterogeneous nature of the training data. We refer the interested readers to [40] and references therein for a comprehensive review in distributed learning.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Straggler-Resilient Distributed Machine Learning with Dynamic Backup Workers

Xiong¹,

Yan²,

Singh³

et al. 2021

Preprint

View full text Add to dashboard Cite

With the increasing demand for large-scale training of machine learning models, consensus-based distributed optimization methods have recently been advocated as alternatives to the popular parameter server framework. In this paradigm, each worker maintains a local estimate of the optimal parameter vector, and iteratively updates it by waiting and averaging all estimates obtained from its neighbors, and then corrects it on the basis of its local dataset. However, the synchronization phase can be time consuming due to the need to wait for stragglers, i.e., slower workers. An efficient way to mitigate this effect is to let each worker wait only for updates from the fastest neighbors before updating its local parameter. The remaining neighbors are called backup workers. To minimize the globally training time over the network, we propose a fully distributed algorithm to dynamically determine the number of backup workers for each worker. We show that our algorithm achieves a linear speedup for convergence (i.e., convergence performance increases linearly with respect to the number of workers). We conduct extensive experiments on MNIST and CIFAR-10 to verify our theoretical results. * Equal contribution. Ordering determined by alphabetical order. 2 Each worker needs to receive the aggregate of updates from all other workers to move to the next iteration, where aggregation is performed either by PS or along the ring through multiple rounds.

show abstract

Section: A2 Proof Of Theoremmentioning

confidence: 76%

Section: Assumptionsmentioning

confidence: 99%

See 1 more Smart Citation

Straggler-Resilient Distributed Machine Learning with Dynamic Backup Workers

Xiong¹,

Yan²,

Singh³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In [60] authors study several aspects of distributed learning and provide a comprehensive survey of both theoretical and practical aspect of distributed machine learning. Perhaps closest to our work is [73] where authors perform a detailed study as to whether network is the bottleneck in distributed training.…”

Section: Related Workmentioning

confidence: 99%

On the Utility of Gradient Compression in Distributed Training Systems

Agarwal,

Wang,

Venkataraman

et al. 2021

Preprint

View full text Add to dashboard Cite

Rapid growth in data sets and the scale of neural network architectures have rendered distributed training a necessity. A rich body of prior work has highlighted the existence of communication bottlenecks in synchronous data-parallel training. To alleviate these bottlenecks, the machine learning community has largely focused on developing gradient and model compression methods. In parallel, the systems community has adopted several High Performance Computing (HPC) techniques to speed up distributed training. In this work, we evaluate the efficacy of gradient compression methods and compare their scalability with optimized implementations of synchronous data-parallel SGD. Surprisingly, we observe that due to computation overheads introduced by gradient compression, the net speedup over vanilla data-parallel training is marginal, if not negative. We conduct an extensive investigation to identify the root causes of this phenomenon, and offer a performance model that can be used to identify the benefits of gradient compression for a variety of system setups. Based on our analysis, we propose a list of desirable properties that gradient compression methods should satisfy, in order for them to provide a meaningful end-to-end speedup.

show abstract

“…The goal is to maintain the achieved performance while reducing communications or computation overheads. To improve communications efficiency [1], [2], coding or model compression schemes (e.g., model pruning, model quantization) can be applied on model parameter transmissions in step (i) and model update transmissions in step (iii). However, these schemes work after local model training and cannot improve the computation efficiency, and they also create computation burden by performing additional compression schemes.…”

Section: B Distributed Deep Training Strategiesmentioning

confidence: 99%

Privacy-Preserving Serverless Edge Learning with Decentralized Small Data

Lin¹,

Lin²

2021

Preprint

View full text Add to dashboard Cite

In the last decade, data-driven algorithms outperformed traditional optimization-based algorithms in many research areas, such as computer vision, natural language processing, etc. However, extensive data usages bring a new challenge or even threat to deep learning algorithms, i.e., privacypreserving. Distributed training strategies have recently become a promising approach to ensure data privacy when training deep models. This paper extends conventional serverless platforms with serverless edge learning architectures and provides an efficient distributed training framework from the networking perspective. This framework dynamically orchestrates available resources among heterogeneous physical units to efficiently fulfill deep learning objectives. The design jointly considers learning task requests and underlying infrastructure heterogeneity, including last-mile transmissions, computation abilities of mobile devices, edge and cloud computing centers, and devices' battery status. Furthermore, to significantly reduce distributed training overheads, small-scale data training is proposed by integrating with a general, simple data classifier. This low-load enhancement can seamlessly work with various distributed deep models to improve communications and computation efficiencies during the training phase. Finally, open challenges and future research directions encourage the research community to develop efficient distributed deep learning techniques.

show abstract

Communication-Efficient Distributed Deep Learning: A Comprehensive Survey

Cited by 25 publications

References 120 publications

Straggler-Resilient Distributed Machine Learning with Dynamic Backup Workers

Straggler-Resilient Distributed Machine Learning with Dynamic Backup Workers

On the Utility of Gradient Compression in Distributed Training Systems

Privacy-Preserving Serverless Edge Learning with Decentralized Small Data

Contact Info

Product

Resources

About