xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep Learning

Weingram, Adam; Li, Yuke; Qi, Hao; Ng, Darren; Dai, Liuyao; Lu, Xiaoyi

doi:10.1007/s11390-023-2894-6

Cited by 10 publications

(2 citation statements)

References 68 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Section 8 of the paper discusses and highlights the related research papers and their explanations, emphasizing the differences between those papers and the current study. One of the papers examined in this section focuses on analyzing latency in various communication libraries in both inter-node and intra-node environments [25]. It delves into the collective communication functions commonly used in distributed deep learning, providing a detailed investigation of each library's performance.…”

Section: Related Workmentioning

confidence: 99%

Collective Communication Performance Evaluation for Distributed Deep Learning Training

Lee,

Lee

2024

Applied Sciences

View full text Add to dashboard Cite

In distributed deep learning, the improper use of the collective communication library can lead to a decline in deep learning performance due to increased communication time. Representative collective communication libraries such as MPI, GLOO, and NCCL exhibit varying performance based on server environment and communication architecture. In this study, we investigate three key aspects to evaluate the performance of the collective communication libraries in a distributed deep learning setting in an intra-node environment. First, we conduct a comparison and analysis of collective communication library performance within common distributed deep learning architectures, such as parameter servers and ring all-reduce methods. Second, we evaluate the performance of these libraries in different environments, including various container platforms and bare metal setups, considering the scalability and flexibility advantages offered by cloud virtualization. Last, to ensure practicality, we assess the libraries’ performance in a Linux shell and within the PyTorch framework. In the cross-docker virtualization environment, NCCL shows up to 213% higher latency compared to single docker, while GLOO exhibits 36% lower latency in single docker than in cross docker, and NCCL achieves up to 345% lower execution time in all-reduce operations compared to other libraries (MPI and GLOO). These findings will inform the selection of an appropriate collective communication library for designing effective distributed deep learning environments.

show abstract

Section: Related Workmentioning

confidence: 99%

Collective Communication Performance Evaluation for Distributed Deep Learning Training

Lee,

Lee

2024

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…Allreduce operator has a wide range of applications in the fields of scientific computing and artificial intelligence, is one of the basic operators of parallel computing, and is also the most important ensemble communication operator used in distributed deep learning. Therefore, it is important to realize the highly efficient, scalable and reliable Allreduce ensemble communication, which is important for improving the performance of computation-intensive applications such as distributed training [2] .…”

Section: Introductionmentioning

confidence: 99%

Research and optimization of intra-node communication operator performance based on domestic heterogeneous platforms

Ming,

Ding,

Han

2024

Third International Conference on Electronic Information Engineering, Big Data, and Computer Technology (EIBDCT 2024)

View full text Add to dashboard Cite

With the increasing demand for computing power in machine learning tasks, the training of deep neural network models has been pushed to multi-GPU training or even larger scale distributed training. However, the acceleration effect and scalability of model training are largely limited by the communication efficiency between GPUs. In order to improve the communication efficiency of domestic GPU accelerator, this paper studies and analyzes the communication performance of Allreduce operator widely used in deep learning tasks. Based on a data compression algorithm and multi-stream parallel technology, this paper ports and optimizes the Allreduce operator on domestic heterogeneous platform. The experimental data results show that compared with RCCL, the ported optimized version of Allreduce operator achieves different degrees of performance improvement such as 30% to 90% under different data sizes. This paper implements the transplantation and optimization of Allreduce operator on the domestic accelerated heterogeneous platform, and achieves good acceleration results, which provides support for the efficient communication of domestic GPU accelerators and the ecological construction of domestic heterogeneous platforms.

show abstract

To Share or Not to Share: A Case for MPI in Shared-Memory

Adam,

Besnard,

Roussel

et al. 2024

Lecture Notes in Computer Science

View full text Add to dashboard Cite

xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep Learning

Cited by 10 publications

References 68 publications

Collective Communication Performance Evaluation for Distributed Deep Learning Training

Collective Communication Performance Evaluation for Distributed Deep Learning Training

Research and optimization of intra-node communication operator performance based on domestic heterogeneous platforms

To Share or Not to Share: A Case for MPI in Shared-Memory

Contact Info

Product

Resources

About