2023
DOI: 10.1007/s11390-023-2894-6
|View full text |Cite
|
Sign up to set email alerts
|

xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep Learning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 10 publications
(2 citation statements)
references
References 68 publications
0
2
0
Order By: Relevance
“…Section 8 of the paper discusses and highlights the related research papers and their explanations, emphasizing the differences between those papers and the current study. One of the papers examined in this section focuses on analyzing latency in various communication libraries in both inter-node and intra-node environments [25]. It delves into the collective communication functions commonly used in distributed deep learning, providing a detailed investigation of each library's performance.…”
Section: Related Workmentioning
confidence: 99%
“…Section 8 of the paper discusses and highlights the related research papers and their explanations, emphasizing the differences between those papers and the current study. One of the papers examined in this section focuses on analyzing latency in various communication libraries in both inter-node and intra-node environments [25]. It delves into the collective communication functions commonly used in distributed deep learning, providing a detailed investigation of each library's performance.…”
Section: Related Workmentioning
confidence: 99%
“…Allreduce operator has a wide range of applications in the fields of scientific computing and artificial intelligence, is one of the basic operators of parallel computing, and is also the most important ensemble communication operator used in distributed deep learning. Therefore, it is important to realize the highly efficient, scalable and reliable Allreduce ensemble communication, which is important for improving the performance of computation-intensive applications such as distributed training [2] .…”
Section: Introductionmentioning
confidence: 99%