2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC) 2016
DOI: 10.1109/mlhpc.2016.004
|View full text |Cite
|
Sign up to set email alerts
|

Communication Quantization for Data-Parallel Training of Deep Neural Networks

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

3
124
1

Year Published

2017
2017
2024
2024

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 145 publications
(128 citation statements)
references
References 8 publications
3
124
1
Order By: Relevance
“…The x-axis in both plots is batches, showing that we are not relying on speed improvement to compensate for convergence. Dryden et al (2016) used a fixed dropping ratio of 98.4% without testing other options. Switching to 99% corresponds to more than a 1.5x reduction in network bandwidth.…”
Section: Drop Ratiomentioning
confidence: 99%
See 2 more Smart Citations
“…The x-axis in both plots is batches, showing that we are not relying on speed improvement to compensate for convergence. Dryden et al (2016) used a fixed dropping ratio of 98.4% without testing other options. Switching to 99% corresponds to more than a 1.5x reduction in network bandwidth.…”
Section: Drop Ratiomentioning
confidence: 99%
“…We focus on data parallelism: nodes jointly optimize the same model on different parts of the training data, exchanging gradients and parameters over the network. This network communication is costly, so prior work developed two ways to approximately compress network traffic: 1-bit quantization (Seide et al, 2014) and sending sparse matrices by dropping small updates (Strom, 2015;Dryden et al, 2016). These methods were developed and tested on speech recognition and toy MNIST systems.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Another notable method to reduce communication overhead in parallel mini-batch SGD is to let each worker use compressed gradients rather than raw gradients for communication. For example, quantized SGD studied in (Seide et al 2014) (Alistarh et al 2017) (Wen et al 2017) or sparsified SGD studied in (Strom 2015) (Dryden et al 2016) (Aji and Heafield 2017) allow each worker to pass low bit quantized or sparsified gradients to the server at every iteration by sacrificing the convergence to a mild extent. Similarly to D-PSGD, such gradient compression based methods require message passing at every iteration and hence their total number of communication rounds is still the same as that in parallel mini-batch SGD.…”
Section: Introductionmentioning
confidence: 99%
“…In recent years, neural network models have grown dramatically in terms of number of parameters (Wen et al, 2017), so exchanging gradients during data-parallel training is costly in terms of both bandwidth and time, especially in a distributed setting. Communication can be reduced (possibly at the expense of convergence) by sending only the top 1% of largest gradients in terms of absolute values, a method known as gradient dropping (Strom, 2015;Dryden et al, 2016;Aji and Heafield, 2017;Lin et al, 2018). Related methods are synchronizing less often (McMahan et al, 2017;Ott et al, 2018; and quantization (Seide et al, 2014;Alistarh et al, 2016).…”
Section: Introductionmentioning
confidence: 99%