Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2022
DOI: 10.1145/3567955.3567959
|View full text |Cite
|
Sign up to set email alerts
|

Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
6
0

Year Published

2023
2023
2025
2025

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 12 publications
(7 citation statements)
references
References 13 publications
1
6
0
Order By: Relevance
“…For very large models (e.g., PALM, MT-NLG) we consider 32-way slicing, and for futuristic ones with one and ten trillion parameters, we consider 64-way sharding. The increasing TP slicing is necessary because these models' larger sizes cannot fit in 16 GPUs [64] and the increased slicing is also enabled by nodes with larger device counts [59,86]. Like prior work [36,49,64], we find that communication is a considerable fraction of the overall runtime: Megatron-GPT-2 (Mega-GPT-2) and T-NLG spend up to 34% and 43% of their training and inference (prompt phase) time on communication.…”
Section: All-reduce Is On the Critical Path And Can Be Largesupporting
confidence: 50%
See 4 more Smart Citations
“…For very large models (e.g., PALM, MT-NLG) we consider 32-way slicing, and for futuristic ones with one and ten trillion parameters, we consider 64-way sharding. The increasing TP slicing is necessary because these models' larger sizes cannot fit in 16 GPUs [64] and the increased slicing is also enabled by nodes with larger device counts [59,86]. Like prior work [36,49,64], we find that communication is a considerable fraction of the overall runtime: Megatron-GPT-2 (Mega-GPT-2) and T-NLG spend up to 34% and 43% of their training and inference (prompt phase) time on communication.…”
Section: All-reduce Is On the Critical Path And Can Be Largesupporting
confidence: 50%
“…However, T3 can also be applied to serialized communication in inter-node setups with slower and often heterogeneous links. Consequently, communication costs can be much larger than GEMM executions, potentially limiting the benefits from fine-grained overlap: once the computation is completely overlapped, the remaining communication costs will be exposed [86]. Nevertheless, T3 can still provide benefits from hiding the GEMM execution cost as much as possible.…”
Section: Multi-node Setupsmentioning
confidence: 99%
See 3 more Smart Citations