A Bandwidth-Saving Optimization for MPI Broadcast Collective Operation

Zhou, Huan; Marjanovic, Vladimir; Niethammer, Christoph; Gracia, José

doi:10.1109/icppw.2015.20

Cited by 7 publications

(3 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Broadcast algorithms have been well-studied in the literature over the decades [8], [33], [36]. However, the emergence of accelerators such as GPUs has significantly changed this field of research.…”

Section: Performance Modeling Of Existing Broadcast Algorithmsmentioning

confidence: 99%

See 1 more Smart Citation

Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters

Awan

Chu

Subramoni

et al. 2018

Proceedings of the 25th European MPI Users' Group Meeting

View full text Add to dashboard Cite

Dense Multi-GPU systems have recently gained a lot of attention in the HPC arena. Traditionally, MPI runtimes have been primarily designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and CUDA-Aware MPI runtimes like MVAPICH2 and OpenMPI, it has become important to address efficient communication schemes for such dense Multi-GPU nodes. This coupled with new application workloads brought forward by Deep Learning frameworks like Caffe and Microsoft CNTK pose additional design constraints due to very large message communication of GPU buffers during the training phase. In this context, specialpurpose libraries like NVIDIA NCCL have been proposed for GPU-based collective communication on dense GPU systems. In this paper, we propose a pipelined chain (ring) design for the MPI Bcast collective operation along with an enhanced collective tuning framework in MVAPICH2-GDR that enables efficient intra-/inter-node multi-GPU communication. We present an in-depth performance landscape for the proposed MPI Bcast schemes along with a comparative analysis of NVIDIA NCCL Broadcast and NCCL-based MPI Bcast. The proposed designs for MVAPICH2-GDR enable up to 14X and 16.6X improvement, compared to NCCL-based solutions, for intra-and inter-node broadcast latency, respectively. In addition, the proposed designs provide up to 7% improvement over NCCL-based solutions for data parallel training of the VGG network on 128 GPUs using Microsoft CNTK.

show abstract

Section: Performance Modeling Of Existing Broadcast Algorithmsmentioning

confidence: 99%

“…Zhou et. al proposed an optimized broadcast for large message sizes in [36]. A slightly orthogonal effort to optimize intra-node communication by exploiting shared-memory programming and MPI one-sided features was presented in [13].…”

Section: Related Workmentioning

confidence: 99%

Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters

Awan

Chu

Subramoni

et al. 2018

Proceedings of the 25th European MPI Users' Group Meeting

View full text Add to dashboard Cite

show abstract

“…Additionally, MPI researchers conduct long-term optimization works on the MPI RMA [14]- [17] and collective operations [18]- [21]. Those optimizations are also liable to be applied onto DART-MPI and then benefit the performance of applications.…”

Section: Related Workmentioning

confidence: 99%

Towards Performance Portability through Locality-Awareness for Applications Using One-Sided Communication Primitives

Zhou

Gracia

2016

2016 Fourth International Symposium on Computing and Networking (CANDAR)

Self Cite

View full text Add to dashboard Cite

Abstract-MPI is the most widely used data transfer and communication model in High Performance Computing. The latest version of the standard, MPI-3, allows skilled programmers to exploit all hardware capabilities of the latest and future supercomputing systems. The revised asynchronous remote-memory-access model in combination with the sharedmemory window extension, in particular, allow writing code that hides communication latencies and optimizes communication paths according to the locality of data origin and destination. The latter is particularly important for today's multi-and many-core systems. However, writing such efficient code is highly complex and error-prone. In this paper we evaluate a recent remote-memory-access model, namely DART-MPI. This model claims to hide the aforementioned complexities from the programmer, but deliver locality-aware remote-memory-access semantics which outperforms MPI-3 one-sided communication primitives on multi-core systems. Conceptually, the DART-MPI interface is simple; at the same time it takes care of the complexities of the underlying MPI-3 and system topology. This makes DART-MPI an interesting candidate for porting legacy applications. We evaluate these claims using a realistic scientific application, specifically a finite-difference stencil code which solves the heat diffusion equation, on a large-scale Cray XC40 installation.

show abstract

Sparbit: Towards to a Logarithmic-Cost and Data Locality-Aware MPI Allgather Algorithm

Loch

Koslovski

2023

J Grid Computing

View full text Add to dashboard Cite

A Bandwidth-Saving Optimization for MPI Broadcast Collective Operation

Cited by 7 publications

References 5 publications

Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters

Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters

Towards Performance Portability through Locality-Awareness for Applications Using One-Sided Communication Primitives

Sparbit: Towards to a Logarithmic-Cost and Data Locality-Aware MPI Allgather Algorithm

Contact Info

Product

Resources

About