2021
DOI: 10.48550/arxiv.2103.00543
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

On the Utility of Gradient Compression in Distributed Training Systems

Saurabh Agarwal,
Hongyi Wang,
Shivaram Venkataraman
et al.

Abstract: Rapid growth in data sets and the scale of neural network architectures have rendered distributed training a necessity. A rich body of prior work has highlighted the existence of communication bottlenecks in synchronous data-parallel training. To alleviate these bottlenecks, the machine learning community has largely focused on developing gradient and model compression methods. In parallel, the systems community has adopted several High Performance Computing (HPC) techniques to speed up distributed training. I… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
2

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 38 publications
0
3
0
Order By: Relevance
“…For example, many distributed systems use all-reduce style techniques to aggregate gradients or local model updates across compute nodes [198]. In such systems, compression techniques that are not compatible with all-reduce may provide less communicationefficiency, despite their higher compression ratio [5,253,258]. In such settings, it is important to make sure that the compression operation commutes with addition.…”
Section: Employ Compression Techniquesmentioning
confidence: 99%
See 1 more Smart Citation
“…For example, many distributed systems use all-reduce style techniques to aggregate gradients or local model updates across compute nodes [198]. In such systems, compression techniques that are not compatible with all-reduce may provide less communicationefficiency, despite their higher compression ratio [5,253,258]. In such settings, it is important to make sure that the compression operation commutes with addition.…”
Section: Employ Compression Techniquesmentioning
confidence: 99%
“…In particular, if we force all clients to use τ -fold larger mini-batch size in synchronous SGD algorithm, then this algorithm can also save τ times communication, which is the same as local-update algorithms. We refer to this algorithm as large-batch synhcronous SGD 5 . It has been well understood (see [66]) that the worst-case error of large-batch synchronous SGD is:…”
Section: Savings In Communication Roundsmentioning
confidence: 99%
“…An efficient solution should significantly reduce the total number of bits used in a given transmission without impacting rapid convergence to a good estimate of θ. In order to provide an end-to-end speedup, a list of properties that compression methods should satisfy are proposed in [27]. Quantization is a popular data compression approach which aims to approximate some quantity using a smaller number of bits to simplify processing, storage and analysis.…”
Section: A Related Workmentioning
confidence: 99%