PipeDream: Fast and Efficient Pipeline Parallel DNN Training

Harlap, Aaron; Narayanan, Deepak; Phanishayee, Amar; Seshadri, Vivek; Devanur, Nikhil R.; Ganger, Greg; Gibbons, Phil

doi:10.48550/arxiv.1806.03377

Cited by 46 publications

(75 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…On the other hand, researches about model parallelism attempt to study how to allocate model parameters and training computation across compute units in a cluster to maximize training throughput and minimize communication overheads. Optimizations are proposed for both operation partitioning approach [37,74,75,92] and pipeline parallel approach [32,34,84]. Recently, there are also approaches that combine both data and model parallelism [51,67,70].…”

Section: Distributed Deep Learningmentioning

confidence: 99%

Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters

Lian¹,

Yuan²,

Zhu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Deep learning based models have dominated the current landscape of production recommender systems. Furthermore, recent years have witnessed an exponential growth of the model scale-from Google's 2016 model with 1 billion parameters to the latest Facebook's model with 12 trillion parameters. Significant quality boost has come with each jump of the model capacity, which makes us believe the era of 100 trillion parameters is around the corner. However, the training of such models is challenging even within industrial scale data centers. This difficulty is inherited from the staggering heterogeneity of the training computation-the model's embedding layer could include more than 99.99% of the total model size, which is extremely memory-intensive; while the rest neural network is increasingly computation-intensive. To support the training of such huge models, an efficient distributed training system is in urgent need. In this paper, we resolve this challenge by careful co-design of both the optimization algorithm and the distributed system architecture. Specifically, in order to ensure both the training efficiency and the training accuracy, we design a novel hybrid training algorithm, where the embedding layer and the dense neural network are handled by different synchronization mechanisms; then we build a system called Persia (short for parallel recommendation training system with hybrid acceleration) to support this hybrid training algorithm. Both theoretical demonstrations and empirical studies up to 100 trillion parameters have been conducted to justified the system design and implementation of Persia. We make Persia publicly available (at https://github.com/PersiaML/Persia) so that anyone would be able to easily train a recommender model at the scale of 100 trillion parameters.

show abstract

Section: Distributed Deep Learningmentioning

confidence: 99%

Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters

Lian¹,

Yuan²,

Zhu³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Pipeline Parallelism: To accelerate the distributed training process, PipeDream (Harlap et al, 2018;Narayanan et al, 2019Narayanan et al, , 2021 and GPipe (Huang et al, 2019) propose the pipelined model parallelism so that multiple input data can be pushed through all the available workers in a sequential order. To be specific, PipeDream pipelines the execution of forward passes and intersperses them with BPs in an attempt to minimize the processor idle time.…”

Section: Related Workmentioning

confidence: 99%

“…To improve the training efficiency, various parallelization techniques such as the dataparallelism (Iandola et al, 2016), model-parallelism (Dean et al, 2012), and a combination of both (Paine et al, 2013;Harlap et al, 2018) have been proposed to reduce the training runtime. Unfortunately, none of these methods could fully overcome the scalability barrier created by the intrinsically serial propagation of data within the network itself (Günther et al, 2020), thereby forcing the distributed machines to work synchronously, and hence preventing us from fully leveraging the computing resources.…”

Section: Introductionmentioning

confidence: 99%

Layer-Parallel Training of Residual Networks with Auxiliary-Variable Networks

Sun¹,

Dong²,

Chen³

et al. 2021

Preprint

View full text Add to dashboard Cite

Gradient-based methods for the distributed training of residual networks (ResNets) typically require a forward pass of the input data, followed by back-propagating the error gradient to update model parameters, which becomes time-consuming as the network goes deeper. To break the algorithmic locking and exploit synchronous module parallelism in both the forward and backward modes, auxiliary-variable methods have attracted much interest lately but suffer from significant communication overhead and lack of data augmentation. In this work, a novel joint learning framework for training realistic ResNets across multiple compute devices is established by trading off the storage and recomputation of external auxiliary variables. More specifically, the input data of each independent processor is generated from its low-capacity auxiliary network (AuxNet), which permits the use of data augmentation and realizes forward unlocking. The backward passes are then executed in parallel, each with a local loss function that originates from the penalty or augmented Lagrangian (AL) methods. Finally, the proposed AuxNet is employed to reproduce the updated auxiliary variables through an end-to-end training process. We demonstrate the effectiveness of our methods on ResNets and WideResNets across CIFAR-10, CIFAR-100, and ImageNet datasets, achieving speedup over the traditional layer-serial training method while maintaining comparable testing accuracy.

show abstract

“…To maximise performance on the IPU, it becomes important to keep as much of the working memory -for example, activation state -on-chip. This naturally promotes the use of much smaller batches, memory saving optimisation (Chen et al, 2016;Gruslys et al, 2016), and innovative forms of distributed processing (Harlap et al, 2018;Huang et al, 2019;Ben-Nun & Hoefler, 2018;Shazeer et al, 2018). At the same time, it does require reconsidering the use of Batch Normalization (Ioffe & Szegedy, 2015), the most common normalization method in vision models, which relies on large batches.…”

Section: Hardware Considerationsmentioning

confidence: 99%

Making EfficientNet More Efficient: Exploring Batch-Independent Normalization, Group Convolutions and Reduced Resolution Training

Masters,

Labatie,

Eaton-Rosen

et al. 2021

Preprint

View full text Add to dashboard Cite

Much recent research has been dedicated to improving the efficiency of training and inference for image classification. This effort has commonly focused on explicitly improving theoretical efficiency, often measured as ImageNet validation accuracy per FLOP. These theoretical savings have, however, proven challenging to achieve in practice, particularly on high-performance training accelerators.In this work, we focus on improving the practical efficiency of the state-of-the-art EfficientNet models on a new class of accelerator, the Graphcore IPU. We do this by extending this family of models in the following ways: (i) generalising depthwise convolutions to group convolutions; (ii) adding proxy-normalized activations to match batch normalization performance with batch-independent statistics; (iii) reducing compute by lowering the training resolution and inexpensively fine-tuning at higher resolution. We find that these three methods improve the practical efficiency for both training and inference. Our code will be made available online.

show abstract

PipeDream: Fast and Efficient Pipeline Parallel DNN Training

Cited by 46 publications

References 23 publications

Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters

Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters

Layer-Parallel Training of Residual Networks with Auxiliary-Variable Networks

Making EfficientNet More Efficient: Exploring Batch-Independent Normalization, Group Convolutions and Reduced Resolution Training

Contact Info

Product

Resources

About