Understanding Top-k Sparsification in Distributed Deep Learning

Shi, Shaohuai; Chu, Xiaowen; Cheung, Ka Chun; See, Simon

doi:10.48550/arxiv.1911.08772

Cited by 26 publications

(47 citation statements)

References 20 publications

(28 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Gradient sparsification [2,4,11,16,19,36,[41][42][43]50] is a key approach to lower the communication volume. By top-𝑘 selection, i.e., only selecting the largest (in terms of the absolute value) 𝑘 of 𝑛 components, the gradient becomes very sparse (commonly around 99%).…”

Section: Background and Related Workmentioning

confidence: 99%

“…Then, the accumulated sparse gradient is used in the Stochastic Gradient Descent (SGD) optimizer to update the model parameters, which is called Top𝑘 SGD. The convergence of Top𝑘 SGD has been theoretically and empirically proved [4,36,41]. However, the parallel scalablity of the existing sparse allreduce algorithms is limited, which makes it very difficult to obtain real performance improvement, especially on the machines (e.g., supercomputers) with high-performance interconnected networks [5,17,37,40].…”

Section: Background and Related Workmentioning

confidence: 99%

“…To solve the fill-in problem, gTop𝑘 hierarchically selects top-𝑘 values in each level of the reduction tree, which results in 4𝑘 log 𝑃 communication volume. Gaussian𝑘 [41] uses the same sparse allreduce algorithm as Top𝑘A with a further optimization for top-𝑘 selection. For our O𝑘-Top𝑘, the communication volume is bounded by 6𝑘.…”

Section: Algorithmsmentioning

confidence: 99%

“…Bitonic top-𝑘 [39] is a GPU-friendly algorithm with complexity O(𝑛 log 2 𝑘), but still not good enough for large 𝑘. To lower the overhead of sparsification, Gaussian𝑘 [41] approximates the gradient values distribution to a Gaussian distribution with the same mean and standard deviation, and then estimates a threshold using the percent-point function and selects the values above the threshold. The top-𝑘 selection in Gaussian𝑘 is GPU-friendly with complexity O(𝑛), but it usually underestimates the value of 𝑘 because of the difference between Gaussian and the real distributions (see Section 3.1.3).…”

Section: Algorithmsmentioning

confidence: 99%

“…Specifically, the communication volume of the existing sparse reduction algorithms grows with the number of processes 𝑃. Taking the allgather-based sparse reduction [36,41,47] as an example, its communication volume is proportional to 𝑃, which eventually surpasses the dense allreduce as 𝑃 increases. Other more complex algorithms [36] suffer from significant fill-in during the reduction, which also leads to a quick increase of the data volume as 𝑃 grows, and may degrade to dense representations on the fly.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Near-Optimal Sparse Allreduce for Distributed Deep Learning

Li,

Hoefler

2022

Preprint

View full text Add to dashboard Cite

Communication overhead is one of the major obstacles to train large deep learning models at scale. Gradient sparsification is a promising technique to reduce the communication volume. However, it is very challenging to obtain real performance improvement because of (1) the difficulty of achieving an scalable and efficient sparse allreduce algorithm and (2) the sparsification overhead. This paper proposes O𝑘-Top𝑘, a scheme for distributed training with sparse gradients. O𝑘-Top𝑘 integrates a novel sparse allreduce algorithm (less than 6𝑘 communication volume which is asymptotically optimal) with the decentralized parallel Stochastic Gradient Descent (SGD) optimizer, and its convergence is proved. To reduce the sparsification overhead, O𝑘-Top𝑘 efficiently selects the top-𝑘 gradient values according to an estimated threshold. Evaluations are conducted on the Piz Daint supercomputer with neural network models from different deep learning domains. Empirical results show that O𝑘-Top𝑘 achieves similar model accuracy to dense allreduce. Compared with the optimized dense and the state-of-the-art sparse allreduces, O𝑘-Top𝑘 is more scalable and significantly improves training throughput (e.g., 3.29x-12.95x improvement for BERT on 256 GPUs).

show abstract

Section: Background and Related Workmentioning

confidence: 99%

Section: Background and Related Workmentioning

confidence: 99%

Section: Algorithmsmentioning

confidence: 99%

Section: Algorithmsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Near-Optimal Sparse Allreduce for Distributed Deep Learning

Li,

Hoefler

2022

Preprint

View full text Add to dashboard Cite

show abstract

Federated learning for energy constrained devices: a systematic mapping study

Mokadem

Maissa²,

Akkaoui³

2022

Cluster Comput

View full text Add to dashboard Cite

Federated Machine Learning (Fed ML) is a new distributed machine learning technique applied to collaboratively train a global model using clients' local data without transmitting it. Nodes only send parameter updates (e.g., weight updates in the case of neural networks), which are fused together by the server to build the global model. By not divulging node data, Fed ML guarantees its confidentiality, a crucial aspect of network security, which enables it to be used in the context of data-sensitive Internet of Things (IoT) and mobile applications, such as smart geo-location and the smart grid. However, most IoT devices are particularly energy constrained, which raises the need to optimize the Fed ML process for efficient training tasks and optimized power consumption. In this paper, we conduct, to the best of our knowledge, the first Systematic Mapping Study (SMS) on FedML optimization techniques for energy-constrained IoT devices. From a total of more than 800 papers, we select 67 that satisfy our criteria and give a structured overview of the field using a set of carefully chosen research questions. Finally, we attempt to provide an analysis of the energy-constrained Fed ML state of the art and try to outline some potential recommendations for the research community.

show abstract

O(1) Communication for Distributed SGD through Two-Level Gradient Averaging

Bhattacharya

Chowdhury

et al. 2021

2021 IEEE International Conference on Cluster Computing (CLUSTER)

View full text Add to dashboard Cite

Understanding Top-k Sparsification in Distributed Deep Learning

Cited by 26 publications

References 20 publications

Near-Optimal Sparse Allreduce for Distributed Deep Learning

Near-Optimal Sparse Allreduce for Distributed Deep Learning

Federated learning for energy constrained devices: a systematic mapping study

O(1) Communication for Distributed SGD through Two-Level Gradient Averaging

Contact Info

Product

Resources

About