Taming unbalanced training workloads in deep learning with partial collective operations

Li, Shigang; Ben-Nun, Tal; Girolamo, Salvatore Di; Alistarh, Dan; Hoefler, Torsten

doi:10.1145/3332466.3374528

Cited by 47 publications

(28 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Those will experience variability in both execution time and memory consumption. It has also been recently observed in machine learning framework on GPUs [32].…”

Section: Related Workmentioning

confidence: 79%

Profiles of Upcoming HPC Applications and Their Impact on Reservation Strategies

Gainaru

Goglin

Honoré

et al. 2021

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

With the expected convergence between HPC, BigData and AI, new applications with different profiles are coming to HPC infrastructures. We aim at better understanding the features and needs of these applications in order to be able to run them efficiently on HPC platforms. The approach followed is bottom-up: we study thoroughly an emerging application, Spatially Localized Atlas Network Tiles (SLANT, originating from the neuroscience community) to understand its behavior. Based on these observations, we derive a generic, yet simple, application model (namely, a linear sequence of stochastic jobs). We expect this model to be representative for a large set of upcoming applications that require the computational power of HPC clusters without fitting the typical behavior of large-scale traditional applications. In a second step, we show how one can manipulate this generic model in a scheduling framework. Specifically we consider the problem of making reservations (both time and memory) for an execution on an HPC platform. We derive solutions using the model of the first step of this work. We experimentally show the robustness of the model, even with very few data or with another application, to generate the model, and provide performance gains with regards to standard and more recent approaches used in the neuroscience community.

show abstract

“…Those will experience variability in both execution time and memory consumption. It has also been recently observed in machine learning framework on GPUs [32].…”

Section: Related Workmentioning

confidence: 79%

Profiles of Upcoming HPC Applications and Their Impact on Reservation Strategies

Gainaru

Goglin

Honoré

et al. 2021

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…To scale up the training process to parallel machines, data parallelism [18,25,26,38,52,53] is the common method, in which the mini-batch is partitioned among 𝑃 workers and each worker maintains a copy of the entire model. Gradient accumulation across 𝑃 workers is often implemented using a standard dense allreduce [12], leading to about 2𝑛 communication volume where 𝑛 is the number of gradient components (equal to the number of model parameters).…”

Section: Background and Related Workmentioning

confidence: 99%

Near-Optimal Sparse Allreduce for Distributed Deep Learning

Li,

Hoefler

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Communication overhead is one of the major obstacles to train large deep learning models at scale. Gradient sparsification is a promising technique to reduce the communication volume. However, it is very challenging to obtain real performance improvement because of (1) the difficulty of achieving an scalable and efficient sparse allreduce algorithm and (2) the sparsification overhead. This paper proposes O𝑘-Top𝑘, a scheme for distributed training with sparse gradients. O𝑘-Top𝑘 integrates a novel sparse allreduce algorithm (less than 6𝑘 communication volume which is asymptotically optimal) with the decentralized parallel Stochastic Gradient Descent (SGD) optimizer, and its convergence is proved. To reduce the sparsification overhead, O𝑘-Top𝑘 efficiently selects the top-𝑘 gradient values according to an estimated threshold. Evaluations are conducted on the Piz Daint supercomputer with neural network models from different deep learning domains. Empirical results show that O𝑘-Top𝑘 achieves similar model accuracy to dense allreduce. Compared with the optimized dense and the state-of-the-art sparse allreduces, O𝑘-Top𝑘 is more scalable and significantly improves training throughput (e.g., 3.29x-12.95x improvement for BERT on 256 GPUs).

show abstract

“…We compare the performance of RNA with three other synchronization models: Horovod [49], AD-PSGD [37], and eager-SGD [35]. Horovod is selected as the state-of-the-art baseline, which significantly outperforms many other implementations of All-Reduce.…”

Section: Approaches and Performance Metricsmentioning

confidence: 99%

“…Prague [39] and Eager-SGD [35] are more related to our approach, which proposes a new communication primitive to allow partial workers to synchronize parameters quickly. Speci cally, Prague o ers both static and dynamic group scheduling to construct a new group randomly during the runtime to avoid con icts.…”

Section: Related Workmentioning

confidence: 99%