Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models

Wang, Shibo; Wei, Jianhui; Sabne, Amit; Davis, Andy; Ilbeyi, Berkin; Hechtman, Blake A.; Chen, Dehao; Murthy, Karthik; Maggioni, Marcello; Zhang, Qiao; Kumar, Sameer; Guo, Tongfei; Xu, Yuanzhong; Zhou, Zongwei

doi:10.1145/3567955.3567959

Cited by 12 publications

(7 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For very large models (e.g., PALM, MT-NLG) we consider 32-way slicing, and for futuristic ones with one and ten trillion parameters, we consider 64-way sharding. The increasing TP slicing is necessary because these models' larger sizes cannot fit in 16 GPUs [64] and the increased slicing is also enabled by nodes with larger device counts [59,86]. Like prior work [36,49,64], we find that communication is a considerable fraction of the overall runtime: Megatron-GPT-2 (Mega-GPT-2) and T-NLG spend up to 34% and 43% of their training and inference (prompt phase) time on communication.…”

Section: All-reduce Is On the Critical Path And Can Be Largesupporting

confidence: 50%

“…However, T3 can also be applied to serialized communication in inter-node setups with slower and often heterogeneous links. Consequently, communication costs can be much larger than GEMM executions, potentially limiting the benefits from fine-grained overlap: once the computation is completely overlapped, the remaining communication costs will be exposed [86]. Nevertheless, T3 can still provide benefits from hiding the GEMM execution cost as much as possible.…”

Section: Multi-node Setupsmentioning

confidence: 99%

“…Topologyindependent In-switch [36] X X X X ACE [71] X X X X CoCoNet [25] X X Google Decomposition [86] X X X T3-MCA Table 3. Comparing T3-MCA to prior work.…”

Section: No Additional Acceleratormentioning

confidence: 99%

“…Although serialized communication scenarios also offer such potential, they require a fine-grained overlap of computation and communication, which presents its own challenges. Enabling their fine-grained overlap in current systems either requires expensive fine-grained synchronization [25] or changes to matrix multiplication (GEMMs) kernels which can be disruptive to GPU software infrastructure [86] (Section 3.1). Furthermore, overlapped compute and communication contend for both compute units and memory bandwidth, reducing overlap's efficacy [25,86] (Section 3.2).…”

Section: Introductionmentioning

confidence: 99%

“…Enabling their fine-grained overlap in current systems either requires expensive fine-grained synchronization [25] or changes to matrix multiplication (GEMMs) kernels which can be disruptive to GPU software infrastructure [86] (Section 3.1). Furthermore, overlapped compute and communication contend for both compute units and memory bandwidth, reducing overlap's efficacy [25,86] (Section 3.2). Prior approaches that reduce contention only address coarse-grained overlap of compute and communication in cases like DP and lack support for fine-grained overlap in serialized collectives [71].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives

Pati,

Aga,

Islam

et al. 2024

Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems,

View full text Add to dashboard Cite

Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a finegrained manner. However, this fine-grained interleaving of communication and computation in software can be difficult. Furthermore, as with any concurrent execution, it requires compute and memory resources to be shared between computation and communication, causing resource contention that reduces overlapping efficacy.To overcome these challenges, we propose T3 which applies hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with compute. T3 transparently fuses producer operations with the subsequent communication via a simple configuration of the producer's output address space and requires minor software changes. At the hardware level, T3 adds a lightweight track and trigger mechanism to orchestrate the producer's compute, and communication. It further uses compute-enhanced memories for communication's attendant compute. As a result, T3 reduces resource contention,

show abstract

Section: All-reduce Is On the Critical Path And Can Be Largesupporting

confidence: 50%

Section: Multi-node Setupsmentioning

confidence: 99%

“…Topologyindependent In-switch [36] X X X X ACE [71] X X X X CoCoNet [25] X X Google Decomposition [86] X X X T3-MCA Table 3. Comparing T3-MCA to prior work.…”

Section: No Additional Acceleratormentioning

confidence: 99%