This paper proposes DisCo, an automatic deep learning compilation module for data-parallel distributed training. Unlike most deep learning compilers that focus on training or inference on a single device, DisCo optimizes a DNN model for distributed training over multiple GPU machines. Existing single-device compilation strategies do not work well in distributed training, due mainly to communication inefficiency that they incur. DisCo generates optimized, joint computation operator and communication tensor fusion strategies to enable highly efficient distributed training. A GNNbased simulator is built to effectively estimate per-iteration training time achieved by operator/tensor fusion candidates. A backtracking search algorithm is driven by the simulator, navigating efficiently in the large strategy space to identify good operator/tensor fusion strategies that minimize distributed training time. We compare DisCo with existing DL fusion schemes and show that it achieves good training speed-up close to the ideal, full computation-communication overlap case.
Keywords Distributed Systems • Machine LearningThere are also projects focusing on model parallelism and pipeline parallelism. Megatron-LM [10] introduces an efficient intra-layer model-parallel approach to support training of very large transformer models. GPipe [11] and Pipedream [12] propose pipeline parallelism to further improve model parallelism, by pipelining forward computation and backward propagation across several micro-batches. CoCoNet [13] enables optimization of data-, model-and pipeline-parallel workloads in large language models by introducing a domain-specific language that easily expresses distributed training of models.This paper focuses on front-end compilation optimization to expedite synchronous data-parallel training. Op fusion strategies have been studied as one of the most important optimization methods to reduce computation overhead [4,14,15]. Tensor fusion has been shown to play an important role in reducing the communication overhead [16,17,18]. We inspect the performance trade-off caused by op fusion and tensor fusion in distributed training, and advocate joint op and tensor fusion optimization. We propose DisCo, an automatic module to jointly optimize computation and communication fusion over a whole distributed DNN training graph. Existing rule-based op fusion strategies rely heavily on expert experience, and are often less than optimal due to limited exploration of the solution space. DisCo adopts a search-based algorithm to identify optimized joint fusion strategies. We summarize main contributions of DisCo in the following:⊲ We propose an automatic compilation module to jointly optimize op and tensor fusion for distributed training of DNN models, that expedites computation and communication separately while maximally overlapping their execution.⊲ Op fusion and tensor fusion, two conventionally separated optimization passes, are unified into a joint strategy space. A backtracking search algorithm is designed to efficient prun...