NASPipe: high performance and reproducible pipeline parallel supernet training via causal synchronous parallelism

Zhao, Shixiong; Li, Fanxin; Chen, Xusheng; Shen, Tianxiang; Chen, Li; Wang, Sen; Zhang, Nicholas; Li, Cheng; Cui, Heming

doi:10.1145/3503222.3507735

Cited by 8 publications

(1 citation statement)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Especially for methods that explore large solution spaces such as the neural architecture search (NAS) [3,4], the problem becomes even more significant. This problem mandates the use of model parallelism [5,6], which creates substantial throughput loss with inevitable pipeline bubbles.…”

Section: Introductionmentioning

confidence: 99%

Pipe-BD: Pipelined Parallel Blockwise Distillation

Hongsun¹,

Jung²,

Song³

et al. 2023

Preprint

View full text Add to dashboard Cite

Training large deep neural network models is highly challenging due to their tremendous computational and memory requirements. Blockwise distillation provides one promising method towards faster convergence by splitting a large model into multiple smaller models. In state-of-the-art blockwise distillation methods, training is performed block-by-block in a data-parallel manner using multiple GPUs. To produce inputs for the student blocks, the teacher model is executed from the beginning until the current block under training. However, this results in a high overhead of redundant teacher execution, low GPU utilization, and extra data loading. To address these problems, we propose Pipe-BD, a novel parallelization method for blockwise distillation. Pipe-BD aggressively utilizes pipeline parallelism for blockwise distillation, eliminating redundant teacher block execution and increasing per-device batch size for better resource utilization. We also extend to hybrid parallelism for efficient workload balancing. As a result, Pipe-BD achieves significant acceleration without modifying the mathematical formulation of blockwise distillation. We implement Pipe-BD on PyTorch, and experiments reveal that Pipe-BD is effective on multiple scenarios, models, and datasets.

show abstract

Section: Introductionmentioning

confidence: 99%