Overlapping computation and communication: Barrier algorithms and ConnectX-2 CORE-Direct capabilities

Graham, Richard L.; Poole, Steve; Shamis, Pavel; Bloch, Gil; Bloch, Noam; Chapman, H.; Kagan, Michael; Shahar, Ariel; Rabinovitz, Ishai; Shainer, Gilad

doi:10.1109/ipdpsw.2010.5470854

Cited by 28 publications

(12 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This management queue allows to delay certain operations until others are finished, and therefore to express dependencies between operations. Researchers reported positive results when implementing single collectives such as barrier and broadcast with this technology [26], [27]. To the best of our knowledge no one has attempted to show that the primitives offered by CORE-Direct are powerful enough to offload any communication schedule, or shown its limits.…”

Section: Experimental Evaluationmentioning

confidence: 99%

Protocols for Fully Offloaded Collective Operations on Accelerated Network Adapters

Schneider

Hoefler

Grant

et al. 2013

2013 42nd International Conference on Parallel Processing

View full text Add to dashboard Cite

Abstract-With each successive generation, network adapters for high-performance networks are becoming more powerful and feature rich. High-performance NICs can now provide support for performing complex group communication operations on the NIC without any host CPU involvement. Several "offloading interfaces" have been designed with the collective communications goal being the complete offloading of arbitrary communication patterns.In this work, we analyze the offloading model offered in the Portals 4 specification in detail. We perform a theoretical analysis based on abstract communication graphs and show several protocols for implementing offloaded communication schedules. Based on our analysis, we propose and implement an extension to the portals 4 specification that enables offloading any communication pattern completely to the NIC. Our measurements with several advanced communication algorithms confirm that the enhancements provide good overlap and asynchronous progress in practical settings. Altogether, we demonstrate a complete and simple scheme for implementing arbitrary offloaded communication algorithms and hardware. Our protocols can act as a blueprint for the development of communication hardware as well as middleware while optimizing the whole communication stack. I. MOTIVATIONMoore's law is still going strong despite the end of frequency and Dennard scaling. CPU and chip vendors have managed to maintain Moore's law by going broad, i.e., duplicate functional units (e.g., cores or vector units) and/or add new functionality to chips. Thus, several microprocessor vendors began extending core functionalities of chips. For example, functionalities that were traditionally in a north bridge, such as a memory controller, are now commonly included in main CPUs. Similarly, networking chips have become more powerful and are to be integrated into next-generation CPUs.The growing number of cores per network endpoint increases the requirements for the network and memory interfaces. Modern multi-core CPUs already scale the number of memory controllers with the cores, similarly, network interfaces may follow. High-performance networks provide much more complex functionality than memory controllers. Thus, it seems reasonable to devote some silicon to performing advanced functions. The most important parameters of today's networks are latency, bandwidth, and message-rate. Therefore, modern networks are highly tuned to provide high performance for these metrics. However, many algorithms from scientific computing and other fields, such as databases, operating systems, and financial computations, use advanced communication algorithms over sets of processes. These are often called "collective communications" in high-performance computing (HPC) and they are important to many types of applications. Their increased complexity over standard point-to-point communications and participation of multiple processes make them important to overall communication performance, as well as a prime target for efforts to enhance network pe...

show abstract

Section: Experimental Evaluationmentioning

confidence: 99%

Protocols for Fully Offloaded Collective Operations on Accelerated Network Adapters

Schneider

Hoefler

Grant

et al. 2013

2013 42nd International Conference on Parallel Processing

View full text Add to dashboard Cite

show abstract

“…The ConnectX-2 [5] network interface is the latest adapter from Mellanox. Along with all of the standard InfiniBand features, it offers a new network offloading feature called CORE-Direct [13]. Using this feature, arbitrary lists of send, receive and wait operations can be created.…”

Section: A Infiniband and Connectx-2 Network Interfacementioning

confidence: 99%

Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers

Kandalla

Yang

Keasler

et al. 2012

2012 IEEE 26th International Parallel and Distributed Processing Symposium

View full text Add to dashboard Cite

Abstract-Scientists across a wide range of domains increasingly rely on computer simulation for their investigations. Such simulations often spend a majority of their run-times solving large systems of linear equations that require vast amounts of computational power and memory. It is hence critical to design solvers in a highly efficient and scalable manner. Hypre is a high performance, scalable software library that offers several optimized linear solver routines and pre-conditioners. In this paper, we study the characteristics of Hypre's Preconditioned Conjugate Gradient (PCG) solver algorithm. The PCG routine is known to spend a majority of its communication time in the MPI Allreduce operation to compute a global summation during the innerproduct operation. The MPI Allreduce is a blocking operation whose latency is often a limiting factor to the overall efficiency of the PCG solver routine, and correspondingly the performance of simulations that rely on this solver. Hence, hiding the latency of the MPI Allreduce operation is critical towards scaling the PCG solver routine and improving the performance of many simulations.The upcoming revision of MPI, MPI-3, will provide support for non-blocking collective communication to enable latency-hiding. The latest InfiniBand adapter from Mellanox, ConnectX-2, enables offloading of generalized lists of communication operations to the network interface. Such an interface can be leveraged to design non-blocking collective operations. In this paper, we design fully functional, scalable algorithms for the MPI Iallreduce operation, based on the network offload technology. To the best of our knowledge, this is the first such design to be presented in the literature. Our designs scale beyond 512 processes and we achieve near perfect communication/computation overlap. We also re-design the PCG solver routine to leverage our proposed MPI Iallreduce operation to hide the latency of the global reduction operations. We observe up to 21% improvements in the run-times of the PCG routine, when compared to the default PCG implementation in Hypre. We also note that about 16% of the overall benefits are due to overlapping the Allreduce operations.

show abstract

“…Researchers have demonstrated the overlap capabilities offered by the ConnectX-2 network interface with MPI_ Barrier [3]. In [4], a set of primitives that can be used to design collective operations to leverage the network offload feature were proposed.…”

Section: Designing Non-blocking Algorithms With Collective Offloadmentioning

confidence: 99%

“…In [3,4,12], researchers have explored various facets of this interface. Using this feature, generic lists of communication tasks can be offloaded to the network interface.…”

Section: Introductionmentioning

confidence: 99%

High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT

et al. 2011

View full text Add to dashboard Cite

Three-dimensional FFT is an important component of many scientific computing applications ranging from fluid dynamics, to astrophysics and molecular dynamics. P3DFFT is a widely used three-dimensional FFT package. It uses the Message Passing Interface (MPI) programming model. The performance and scalability of parallel 3D FFT is limited by the time spent in the Alltoall Personalized exchange (MPI_Alltoall) operations. Hiding the latency of the MPI_Alltoall operation is critical towards scaling P3DFFT. The newest revision of MPI, MPI-3, is widely expected to provide support for non-blocking collective communication to enable latency-hiding. The latest InfiniBand adapter from Mellanox, ConnectX-2, enables offloading of generalized lists of communication operations to the network interface. Such an interface can be leveraged to design nonblocking collective operations. In this paper, we design a scalable, non-blocking Alltoall Personalized Exchange algorithm based on the network offload technology. To the best of our knowledge, this is the first paper to propose high performance non-blocking algorithms for dense collective operations, by leveraging InfiniBand's network offload features. We also re-design the P3DFFT library and a sample application kernel to overlap the Alltoall operations with application-level computation. We are able to scale our implementation of the non-blocking Alltoall operation to more than 512 processes and we achieve near perfect computation/communication overlap (99%). We also see an improvement of about 23% in the overall run-time of our modified P3DFFT when compared to the default-blocking version and an improvement of about 17% when compared to the host-based non-blocking Alltoall schemes.Keywords Non-blocking collective communication · InfiniBand network offload · 3DFFT · Alltoall personalized exchange · Message passing interface (MPI)

show abstract

Overlapping computation and communication: Barrier algorithms and ConnectX-2 CORE-Direct capabilities

Cited by 28 publications

References 16 publications

Protocols for Fully Offloaded Collective Operations on Accelerated Network Adapters

Protocols for Fully Offloaded Collective Operations on Accelerated Network Adapters

Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers

High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT

Contact Info

Product

Resources

About