MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters

Wang, Hao; Potluri, Sreeram; Luo, Miao; Singh, Ashish Kumar; Sur, Sayantan; Panda, Dhabaleswar K.

doi:10.1007/s00450-011-0171-3

Cited by 108 publications

(58 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To bridge the gap between the disjointed MPI and GPU programming models, researchers have recently developed GPU-integrated MPI solutions such as our MPI-ACC [6] framework and MVAPICH-GPU [28] by Wang et al These frameworks provide a unified MPI data transmission interface for both host and GPU memories; in other words, the programmer can use either the CPU buffer or the GPU buffer directly as the communication parameter in MPI routines. The goal of such GPU-integrated MPI platforms is to decouple the complex, low-level, GPU-specific data movement optimizations from the application logic, thus providing the following benefits: (1) portability: the application can be more portable across multiple accelerator platforms; and (2) forward compatibility: with the same code, the application can automatically achieve performance improvements from new GPU technologies (e.g., GPUDirect RDMA) if applicable and supported by the MPI implementation.…”

Section: Application Design Using Gpu-integrated Mpi Frameworkmentioning

confidence: 99%

“…The cudaMPI library studies providing wrapper API functions by mixing CUDA and MPI data movement [21]. Similarly to MPI-ACC, Wang et al propose to add CUDA [2] support to MVAPICH2 [22] and optimize the internode communication for InfiniBand networks [28]. All-to-all communication [27] and noncontiguous datatype communication [17,29] have also been studied in the context of GPUaware MPI.…”

Section: Related Workmentioning

confidence: 99%

“…Also, significant programmer effort would be required to recover this performance through vendor-and system-specific optimizations, including GPU-Direct [3] and node and I/O topology awareness. Consequently, GPU-aware extensions to parallel programming models, such as the Message Passing Interface (MPI), have recently been developed, for example, MPI-ACC [6,19] and MVAPICH2-GPU [28]. While such libraries provide a unified and highly efficient data communication mechanism for point-to-point, one-sided, and collective communications among CPUs and GPUs, an in-depth characterization of their impact on the execution profiles of scientific applications is yet to be performed.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

On the efficacy of GPU-integrated MPI for scientific applications

Aji

Panwar

et al. 2013

Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing

View full text Add to dashboard Cite

Scientific computing applications are quickly adapting to leverage the massive parallelism of GPUs in large-scale clusters. However, the current hybrid programming models require application developers to explicitly manage the disjointed host and GPU memories, thus reducing both efficiency and productivity. Consequently, GPU-integrated MPI solutions, such as MPI-ACC and MVAPICH2-GPU, have been developed that provide unified programming interfaces and optimized implementations for end-to-end data communication among CPUs and GPUs. To date, however, there lacks an in-depth performance characterization of the new optimization spaces or the productivity impact of such GPU-integrated communication systems for scientific applications.In this paper, we study the efficacy of GPU-integrated MPI on scientific applications from domains such as epidemiology simulation and seismology modeling, and we discuss the lessons learned. We use MPI-ACC as an example implementation and demonstrate how the programmer can seamlessly choose between either the CPU or the GPU as the logical communication end point, depending on the application's computational requirements. MPI-ACC also encourages programmers to explore novel application-specific optimizations, such as internode CPU-GPU communication with concurrent CPU-GPU computations, which can improve the overall cluster utilization. Furthermore, MPI-ACC internally implements scalable memory management techniques, thereby decoupling the low-level memory optimizations from the applications and making them scalable and portable across several architectures. Experimental results from a state-of-the-art cluster with hundreds of GPUs show that the MPI-ACC-driven new applicationspecific optimizations can improve the performance of an epidemiology simulation by up to 61.6% and the performance of a seismology modeling application by up to 44%, when compared with traditional hybrid MPI+GPU implementations. We conclude that GPU-integrated MPI significantly enhances programmer producPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

show abstract

Section: Application Design Using Gpu-integrated Mpi Frameworkmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

On the efficacy of GPU-integrated MPI for scientific applications

Aji

Panwar

et al. 2013

Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing

View full text Add to dashboard Cite

show abstract

“…In this simple example, the only purpose of the host buf buffer is to facilitate MPI communication of data stored in device memory. As the number of accelerators (and hence distinct memories) per node increases, manual data movement poses significant productivity problems [5].…”

Section: Challenges In Cuda+mpi Programmingmentioning

confidence: 99%

“…This work has focused on MPI point-to-point communication for internode GPU communication [5], all-to-all communication [14], and noncontiguous-type communication [15]. Similar work has also been proposed in the context of OpenMPI [16].…”

Section: Related Workmentioning

confidence: 99%

Efficient Intranode Communication in GPU-Accelerated Systems

Aji

Dinan

et al. 2012

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops &Amp; PhD Forum

View full text Add to dashboard Cite

Abstract-Current implementations of MPI are unaware of accelerator memory (i.e., GPU device memory) and require programmers to explicitly move data between memory spaces. This approach is inefficient, especially for intranode communication where it can result in several extra copy operations. In this work, we integrate GPU-awareness into a popular MPI runtime system and develop techniques to significantly reduce the cost of intranode communication involving one or more GPUs. Experiment results show an up to 2x increase in bandwidth, resulting in an average of 4.3% improvement to the total execution time of a halo exchange benchmark.

show abstract

Communication efficient work distributions in stencil operation based applications

Schneible

Řı́ha

Malik

et al. 2014

Concurrency and Computation

View full text Add to dashboard Cite

SummaryIn recent years, the use of accelerators in conjunction with CPUs, known as heterogeneous computing, has brought about significant performance increases for scientific applications. One of the best examples of this is lattice quantum chromodynamics (QCD), a stencil operation based simulation. These simulations have a large memory footprint necessitating the use of many graphics processing units (GPUs) in parallel. This requires the use of a heterogeneous cluster with one or more GPUs per node. In order to obtain optimal performance, it is necessary to determine an efficient communication pattern between GPUs on the same node and between nodes. In this paper, we present a performance model based method for minimizing the communication time of applications with stencil operations, such as lattice QCD, on heterogeneous computing systems with a non‐blocking InfiniBand interconnection network. The proposed method is able to increase the performance of the most computationally intensive kernel of lattice QCD by 25% due to improved overlapping of communication and computation. We also demonstrate that the aforementioned performance model and efficient communication patterns can be used to determine a cost efficient heterogeneous system design for stencil operation based applications. Copyright © 2014 John Wiley & Sons, Ltd.

show abstract

MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters

Cited by 108 publications

References 6 publications

On the efficacy of GPU-integrated MPI for scientific applications

On the efficacy of GPU-integrated MPI for scientific applications

Efficient Intranode Communication in GPU-Accelerated Systems

Communication efficient work distributions in stencil operation based applications

Contact Info

Product

Resources

About