HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism

Park, Jay; Yun, Gyeongchan; Yi, Chang M.; Nguyen, Nguyen T.; Lee, Seungmin; Choi, Jaesik; Noh, Sam H.; Choi, Young-ri

doi:10.48550/arxiv.2005.14038

Cited by 1 publication

(4 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Parameter staleness-free Mesh-TensorFlow [14], Megatron-LM [10] Tensor Yes Manual No Yes OptCNN [15], FlexFlow [16], Tofu [17] Tensor Yes Auto No Yes GPipe [11] Graph No Manual No Yes AMPNet [18], XPipe [19] Graph No Manual No No PipeDream [8], SpecTrain [20] Graph Yes Auto No No PipeDream-2BW [21], HetPipe [22] Graph Yes Auto Yes No RaNNC (Ours) Graph Yes Auto Yes Yes In graph partitioning, such tasks are regarded as atomic and cannot be further partitioned. Unfortunately, when the partitioned subcomponents to be computed on different accelerator devices have sequential dependencies, only one accelerator device can be used at a time.…”

Section: Memory Estimationmentioning

confidence: 99%

“…As mentioned in the previous section, some previous works [18], [19], [8], [20], [21], [22] employed asynchronous pipeline parallelism, which suffers from parameter staleness issues [9]. Such issues are caused by computing a mini-batch using different versions of parameters across stages.…”

Section: Memory Estimationmentioning

confidence: 99%

“…Since it is often challenging to manually determine stages for pipeline parallelism to achieve a high throughput, some previous works proposed automatic approaches that search for combinations of subcomponents that form stages whose computation times are balanced [20], [8], [21], [22].…”

Section: Automatic Graph Partitioningmentioning

confidence: 99%

“…If the given blocks are imbalanced, the entire pipeline suffers from a low throughput due to a timeconsuming bottleneck stage composed of the blocks. Although PipeDream-2BW and HetPipe [22] take layers that repetitively appear as blocks, in their actual models, such intuitively selected layers are not well-balanced. For example, the last layer of the BERT-Based [1] model takes 40% of the overall computation time because it performs matrix multiplication of huge matrices to compute the probabilities of the words in the vocabulary.…”

Section: Automatic Graph Partitioningmentioning

confidence: 99%

See 3 more Smart Citations

Automatic Graph Partitioning for Very Large-scale Deep Learning

Tanaka

Taura

Hanawa

et al. 2021

Preprint

View full text Add to dashboard Cite

This work proposes RaNNC (Rapid Neural Network Connector) as middleware for automatic hybrid parallelism. In recent deep learning research, as exemplified by T5 and GPT-3, the size of neural network models continues to grow. Since such models do not fit into the memory of accelerator devices, they need to be partitioned by model parallelism techniques. Moreover, to accelerate training for huge training data, we need a combination of model and data parallelisms, i.e., hybrid parallelism. Given a model description for PyTorch without any specification for model parallelism, RaNNC automatically partitions the model into a set of subcomponents so that (1) each subcomponent fits a device memory and (2) a high training throughput for pipeline parallelism is achieved by balancing the computation times of the subcomponents. Since the search space for partitioning models can be extremely large, RaNNC partitions a model through the following three phases. First, it identifies atomic subcomponents using simple heuristic rules. Next it groups them into coarser-grained blocks while balancing their computation times. Finally, it uses a novel dynamic programming-based algorithm to efficiently search for combinations of blocks to determine the final partitions. In our experiments, we compared RaNNC with two popular frameworks, Megatron-LM (hybrid parallelism) and GPipe (originally proposed for model parallelism, but a version allowing hybrid parallelism also exists), for training models with increasingly greater numbers of parameters. In the pretraining of enlarged BERT models, RaNNC successfully trained models five times larger than those Megatron-LM could, and RaNNC's training throughputs were comparable to Megatron-LM's when pre-training the same models. RaNNC also achieved better training throughputs than GPipe on both the enlarged BERT model pre-training (GPipe with hybrid parallelism) and the enlarged ResNet models (GPipe with model parallelism) in all of the settings we tried. These results are remarkable, since RaNNC automatically partitions models without any modification to their descriptions; Megatron-LM and GPipe require users to manually rewrite the models' descriptions.

show abstract