Protocols for Fully Offloaded Collective Operations on Accelerated Network Adapters

Schneider, Timo; Hoefler, Torsten; Grant, Ryan E.; Barrett, Brian; Brightwell, Ron

doi:10.1109/icpp.2013.73

Cited by 18 publications

(7 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…III. Schneider et al [22] discuss about protocols for fully offloaded collectives, however, their protocol requires synchronization among the involved nodes. Barrett et al [8] propose an offloaded version of the rendezvous protocol based on Portals 4 triggered operations, requiring CPU intervention in the unexpected message case.…”

Section: B Simulationsmentioning

confidence: 99%

“…In the sender-initiated version, a control message is sent to the receiver that will reply when the matching receive will be posted (and thus the receiver buffer will be ready). In the receiver-initiated version [22], the receiver has to signal to the sender when it is able to receive the message. Without loss of generality, in this work we consider only the sender-initiated variant of this protocol, since the receiver-initiated one can be implemented similarly.…”

Section: ) Eager Protocolmentioning

confidence: 99%

See 1 more Smart Citation

Exploiting Offload Enabled Network Interfaces

Girolamo

Jolivet

Underwood

et al. 2015

2015 IEEE 23rd Annual Symposium on High-Performance Interconnects

Self Cite

View full text Add to dashboard Cite

Network interface cards are one of the key components to achieve efficient parallel performance. In the past, they have gained new functionalities such as lossless transmission and remote direct memory access that are now ubiquitous in high-performance systems. Prototypes of next generation network cards now offer new features that facilitate device programming. In this work, various possible uses of network offload features are explored. We use the Portals 4 interface specification as an example to demonstrate various techniques such as fully asynchronous, multi-schedule asynchronous, and solo collective communications. MPI collectives are used as a proof of concept for how to leverage our proposed semantics. In a solo collective, one or more processes can participate in a collective communication without being aware of it. This semantic enables fully asynchronous algorithms. We discuss how the application of the solo collectives can improve the performance of iterative methods, such as multigrid solvers. The results obtained show how this work may be used to accelerate existing MPI applications, but they also display how these techniques could ease the programming of algorithms outside of the Bulk Synchronous Parallel (BSP) model.

show abstract

Section: B Simulationsmentioning

confidence: 99%

Section: ) Eager Protocolmentioning

confidence: 99%

Exploiting Offload Enabled Network Interfaces

Girolamo

Jolivet

Underwood

et al. 2015

2015 IEEE 23rd Annual Symposium on High-Performance Interconnects

Self Cite

View full text Add to dashboard Cite

show abstract

“…A number of convenience constructs, such as parallel threaded loops and reduction operations are also provided. The remote operation is built on top of Portals4 library [14]. Qthreads execute on POSIX-compliant machines and have been tested on Linux, Solaris, and Mac OS using GNU, Intel, PGI, and Tilera compilers.…”

Section: Qthreadsmentioning

confidence: 99%

A Survey: Runtime Software Systems for High Performance Computing

2017

JSFI

View full text Add to dashboard Cite

High Performance Computing system design and operation are challenged by requirements for significant advances in efficiency, scalability, productivity, and portability at the end of Moore's Law with approaching nano-scale technology. Conventional practices employ message-passing programming interfaces; sometimes combining thread-based shared memory interfaces such as OpenMP. These methods they are principally coarse grained and statically scheduled. Yet, performance for many real-world applications yield efficiencies of less than 10% even though some benchmarks achieve 80% efficiency or better (e.g., HPL). To address these challenges, strategies employing runtime software systems are being pursued to exploit information about the status of the application and the system hardware operation throughout the execution to guide task scheduling and resource management for dynamic adaptive control. Runtimes provide adaptive means to reduce the effects of starvation, latency, overhead, and contention. Many share common properties such as multi-tasking either preemptive or non-preemptive, message-driven computation such as active messages, sophisticated fine-grain synchronization such as dataflow and future constructs, global name or address spaces, and control policies for optimizing task scheduling to address the uncertainty of asynchrony. This survey will identify key parameters and properties of modern and experimental runtime systems actively employed today and provide a detailed description, summary, and comparison within a shared space of dimensions. It is not the intent of this paper to determine which is better or worse but rather to provide sufficient detail to permit the reader to select among them according to individual need.

show abstract

“…Several APIs have been proposed for offloading collective operation management to the HCA. This includes the Mellanox's CORE-Direct [13], protocol, Portal 4.0 triggered operations [7], and an extension to Portals 4.0 [29]. All these support protocols that use end-point management of the collective operations, whereas in the current approach the end-points are involved only in collective initiation and completion, with the switching infrastructure supporting the collective operation management.…”

Section: Previous Workmentioning

confidence: 99%

Towards A Data Centric System Architecture: SHARP

Graham

Bloch

Bureddy

et al. 2017

JSFI

View full text Add to dashboard Cite

Increased system size and a greater reliance on utilizing system parallelism to achieve computational needs, requires innovative system architectures to meet the simulation challenges. The SHARP technology is a step towards a data-centric architecture, where data is manipulated throughout the system. This paper introduces a new SHARP optimization, and studies aspects that impact application performance in a data-centric environment. The use of UD-Multicast to distribute aggregation results is introduced, reducing the letency of an eight-byte MPI Allreduce() across 128 nodes by 16%. Use of reduction trees that avoid the inter-socket bus further improves the eight-byte MPI Allreduce() latency across 128 nodes, with 28 processes per node, by 18%. The distribution of latency across processes in the communicator is studied, as is the capacity of the system to process concurrent aggregation operations.

show abstract

Protocols for Fully Offloaded Collective Operations on Accelerated Network Adapters

Cited by 18 publications

References 26 publications

Exploiting Offload Enabled Network Interfaces

Exploiting Offload Enabled Network Interfaces

A Survey: Runtime Software Systems for High Performance Computing

Towards A Data Centric System Architecture: SHARP

Contact Info

Product

Resources

About