Breaking the computation and communication abstraction barrier in distributed machine learning workloads

Jangda, Abhinav; Huang, Jun; Liu, Guodong; Sabet, Amir Hossein Nodehi; Maleki, Saeed; Miao, Youshan; Musuvathi, Madanlal; Mytkowicz, Todd; Olli, Sarikivi,

doi:10.1145/3503222.3507778

Cited by 18 publications

(7 citation statements)

References 27 publications

(37 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Models and their deployment: Since Transformers are fast-evolving, we evaluate T3's impact on a range of Transformer models and TP degrees (Table 2). For Megatron-GPT-2 (Mega-GPT-2) [78] and T-NLG [47] we use 16K and 8K input tokens (= input-length * batch-size) and TP degrees of eight and 16, given their modern intra-node setups [25,36,47,78].…”

Section: Applications Deployment and Gemmsmentioning

confidence: 99%

“…Topologyindependent In-switch [36] X X X X ACE [71] X X X X CoCoNet [25] X X Google Decomposition [86] X X X T3-MCA Table 3. Comparing T3-MCA to prior work.…”

Section: No Additional Acceleratormentioning

confidence: 99%

“…Although serialized communication scenarios also offer such potential, they require a fine-grained overlap of computation and communication, which presents its own challenges. Enabling their fine-grained overlap in current systems either requires expensive fine-grained synchronization [25] or changes to matrix multiplication (GEMMs) kernels which can be disruptive to GPU software infrastructure [86] (Section 3.1). Furthermore, overlapped compute and communication contend for both compute units and memory bandwidth, reducing overlap's efficacy [25,86] (Section 3.2).…”

Section: Introductionmentioning

confidence: 99%

“…Enabling their fine-grained overlap in current systems either requires expensive fine-grained synchronization [25] or changes to matrix multiplication (GEMMs) kernels which can be disruptive to GPU software infrastructure [86] (Section 3.1). Furthermore, overlapped compute and communication contend for both compute units and memory bandwidth, reducing overlap's efficacy [25,86] (Section 3.2). Prior approaches that reduce contention only address coarse-grained overlap of compute and communication in cases like DP and lack support for fine-grained overlap in serialized collectives [71].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives

Pati,

Aga,

Islam

et al. 2024

Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems,

View full text Add to dashboard Cite

Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a finegrained manner. However, this fine-grained interleaving of communication and computation in software can be difficult. Furthermore, as with any concurrent execution, it requires compute and memory resources to be shared between computation and communication, causing resource contention that reduces overlapping efficacy.To overcome these challenges, we propose T3 which applies hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with compute. T3 transparently fuses producer operations with the subsequent communication via a simple configuration of the producer's output address space and requires minor software changes. At the hardware level, T3 adds a lightweight track and trigger mechanism to orchestrate the producer's compute, and communication. It further uses compute-enhanced memories for communication's attendant compute. As a result, T3 reduces resource contention,

show abstract

Section: Applications Deployment and Gemmsmentioning

confidence: 99%

“…Topologyindependent In-switch [36] X X X X ACE [71] X X X X CoCoNet [25] X X Google Decomposition [86] X X X T3-MCA Table 3. Comparing T3-MCA to prior work.…”

Section: No Additional Acceleratormentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives

Pati,

Aga,

Islam

et al. 2024

Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems,

View full text Add to dashboard Cite

show abstract

“…GShard [8] creates shards of weights and model states that can be split among ranks. CoCoNet [13] introduces a domain-specific language to easily express communication and computation in distributed training. Megatron-LM [10] introduces an efficient intra-layer model-parallel approach to support training of very large transformer models.…”

Section: Distributed Neural Network Trainingmentioning

confidence: 99%

Optimizing DNN Compilation for Distributed Training With Joint OP and Tensor Fusion

Zhang

Diao

et al. 2022

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

This paper proposes DisCo, an automatic deep learning compilation module for data-parallel distributed training. Unlike most deep learning compilers that focus on training or inference on a single device, DisCo optimizes a DNN model for distributed training over multiple GPU machines. Existing single-device compilation strategies do not work well in distributed training, due mainly to communication inefficiency that they incur. DisCo generates optimized, joint computation operator and communication tensor fusion strategies to enable highly efficient distributed training. A GNNbased simulator is built to effectively estimate per-iteration training time achieved by operator/tensor fusion candidates. A backtracking search algorithm is driven by the simulator, navigating efficiently in the large strategy space to identify good operator/tensor fusion strategies that minimize distributed training time. We compare DisCo with existing DL fusion schemes and show that it achieves good training speed-up close to the ideal, full computation-communication overlap case. Keywords Distributed Systems • Machine LearningThere are also projects focusing on model parallelism and pipeline parallelism. Megatron-LM [10] introduces an efficient intra-layer model-parallel approach to support training of very large transformer models. GPipe [11] and Pipedream [12] propose pipeline parallelism to further improve model parallelism, by pipelining forward computation and backward propagation across several micro-batches. CoCoNet [13] enables optimization of data-, model-and pipeline-parallel workloads in large language models by introducing a domain-specific language that easily expresses distributed training of models.This paper focuses on front-end compilation optimization to expedite synchronous data-parallel training. Op fusion strategies have been studied as one of the most important optimization methods to reduce computation overhead [4,14,15]. Tensor fusion has been shown to play an important role in reducing the communication overhead [16,17,18]. We inspect the performance trade-off caused by op fusion and tensor fusion in distributed training, and advocate joint op and tensor fusion optimization. We propose DisCo, an automatic module to jointly optimize computation and communication fusion over a whole distributed DNN training graph. Existing rule-based op fusion strategies rely heavily on expert experience, and are often less than optimal due to limited exploration of the solution space. DisCo adopts a search-based algorithm to identify optimized joint fusion strategies. We summarize main contributions of DisCo in the following:⊲ We propose an automatic compilation module to jointly optimize op and tensor fusion for distributed training of DNN models, that expedites computation and communication separately while maximally overlapping their execution.⊲ Op fusion and tensor fusion, two conventionally separated optimization passes, are unified into a joint strategy space. A backtracking search algorithm is designed to efficient prun...

show abstract

APapo: An asynchronous parallel optimization method for DNN models

Liu,

2024

Future Generation Computer Systems

View full text Add to dashboard Cite

Breaking the computation and communication abstraction barrier in distributed machine learning workloads

Cited by 18 publications

References 27 publications

T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives

T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives

Optimizing DNN Compilation for Distributed Training With Joint OP and Tensor Fusion

APapo: An asynchronous parallel optimization method for DNN models

Contact Info

Product

Resources

About