Communication Quantization for Data-Parallel Training of Deep Neural Networks

Dryden, Nikoli; Moon, Tim; Jacobs, Sam Adé; Essen, Brian C. Van

doi:10.1109/mlhpc.2016.004

Cited by 145 publications

(128 citation statements)

References 8 publications

Supporting

Mentioning

124

Contrasting

Order By: Relevance

“…The x-axis in both plots is batches, showing that we are not relying on speed improvement to compensate for convergence. Dryden et al (2016) used a fixed dropping ratio of 98.4% without testing other options. Switching to 99% corresponds to more than a 1.5x reduction in network bandwidth.…”

Section: Drop Ratiomentioning

confidence: 99%

“…We focus on data parallelism: nodes jointly optimize the same model on different parts of the training data, exchanging gradients and parameters over the network. This network communication is costly, so prior work developed two ways to approximately compress network traffic: 1-bit quantization (Seide et al, 2014) and sending sparse matrices by dropping small updates (Strom, 2015;Dryden et al, 2016). These methods were developed and tested on speech recognition and toy MNIST systems.…”

Section: Introductionmentioning

confidence: 99%

“…Throughout this paper, we compare neural machine translation behavior with a toy MNIST system, chosen because prior work used a similar system (Dryden et al, 2016). NMT parameters are dominated by three large embedding matrices: source language input, target language input, and target language output.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Sparse Communication for Distributed Gradient Descent

Aji¹,

Heafield²

2017

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

495

202

View full text Add to dashboard Cite

We make distributed stochastic gradient descent faster by exchanging sparse updates instead of dense updates. Gradient updates are positively skewed as most updates are near zero, so we map the 99% smallest updates (by absolute value) to zero then exchange sparse matrices. This method can be combined with quantization to further improve the compression. We explore different configurations and apply them to neural machine translation and MNIST image classification tasks. Most configurations work on MNIST, whereas different configurations reduce convergence rate on the more complex translation task. Our experiments show that we can achieve up to 49% speed up on MNIST and 22% on NMT without damaging the final accuracy or BLEU.

show abstract

Section: Drop Ratiomentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Sparse Communication for Distributed Gradient Descent

Aji¹,

Heafield²

2017

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

495

202

View full text Add to dashboard Cite

show abstract

“…Another notable method to reduce communication overhead in parallel mini-batch SGD is to let each worker use compressed gradients rather than raw gradients for communication. For example, quantized SGD studied in (Seide et al 2014) (Alistarh et al 2017) (Wen et al 2017) or sparsified SGD studied in (Strom 2015) (Dryden et al 2016) (Aji and Heafield 2017) allow each worker to pass low bit quantized or sparsified gradients to the server at every iteration by sacrificing the convergence to a mild extent. Similarly to D-PSGD, such gradient compression based methods require message passing at every iteration and hence their total number of communication rounds is still the same as that in parallel mini-batch SGD.…”

Section: Introductionmentioning

confidence: 99%

Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning

Yang

Zhu

2019

AAAI

425

274

View full text Add to dashboard Cite

In distributed training of deep neural networks, parallel minibatch SGD is widely used to speed up the training process by using multiple workers. It uses multiple workers to sample local stochastic gradient in parallel, aggregates all gradients in a single server to obtain the average, and update each worker's local model using a SGD update with the averaged gradient. Ideally, parallel mini-batch SGD can achieve a linear speed-up of the training time (with respect to the number of workers) compared with SGD over a single worker. However, such linear scalability in practice is significantly limited by the growing demand for gradient communication as more workers are involved. Model averaging, which periodically averages individual models trained over parallel workers, is another common practice used for distributed training of deep neural networks since (Zinkevich et al. 2010) (McDonald, Hall, and Mann 2010). Compared with parallel mini-batch SGD, the communication overhead of model averaging is significantly reduced. Impressively, tremendous experimental works have verified that model averaging can still achieve a good speed-up of the training time as long as the averaging interval is carefully controlled. However, it remains a mystery in theory why such a simple heuristic works so well. This paper provides a thorough and rigorous theoretical study on why model averaging can work as well as parallel mini-batch SGD with significantly less communication overhead.1 Equivalently, we can let the server update its solution using the averaged gradient and broadcast this solution to all local workers. Another equivalent implementation is to let each worker take a single SGD step using its own gradient and send the updated local solution to the server; let the server calculate the average of all workers' updated solutions and refresh each worker's local solution with the averaged version. arXiv:1807.06629v3 [math.OC]

show abstract

“…In recent years, neural network models have grown dramatically in terms of number of parameters (Wen et al, 2017), so exchanging gradients during data-parallel training is costly in terms of both bandwidth and time, especially in a distributed setting. Communication can be reduced (possibly at the expense of convergence) by sending only the top 1% of largest gradients in terms of absolute values, a method known as gradient dropping (Strom, 2015;Dryden et al, 2016;Aji and Heafield, 2017;Lin et al, 2018). Related methods are synchronizing less often (McMahan et al, 2017;Ott et al, 2018; and quantization (Seide et al, 2014;Alistarh et al, 2016).…”

Section: Introductionmentioning

confidence: 99%

Combining Global Sparse Gradients with Local Gradients in Distributed Neural Network Training

Aji¹,

Heafield²,

Bogoychev³

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

One way to reduce network traffic in multinode data-parallel stochastic gradient descent is to only exchange the largest gradients. However, doing so damages the gradient and degrades the model's performance. Tranformer models degrade dramatically while the impact on RNNs is smaller. We restore gradient quality by combining the compressed global gradient with the node's locally computed uncompressed gradient. Neural machine translation experiments show that Transformer convergence is restored while RNNs converge faster. With our method, training on 4 nodes converges up to 1.5x as fast as with uncompressed gradients and scales 3.5x relative to singlenode training.

show abstract

Communication Quantization for Data-Parallel Training of Deep Neural Networks

Cited by 145 publications

References 8 publications

Sparse Communication for Distributed Gradient Descent

Sparse Communication for Distributed Gradient Descent

Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning

Combining Global Sparse Gradients with Local Gradients in Distributed Neural Network Training

Contact Info

Product

Resources

About