Efficient asynchronous memory copy operations on multi-core systems and I/OAT

Vaidyanathan, Karthikeyan; Chai, Lei; Huang, Wei; Panda, D.K.

doi:10.1109/clustr.2007.4629228

Cited by 22 publications

(13 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Userlevel memory copy offload with the I/OAT DMA Engine has been studied in a single application [16]. Its comparison with offloading a regular memcpy in a thread revealed the same conclusion as ours: I/OAT becomes interesting for megabyte and larger messages [17].…”

Section: Related Workmentioning

confidence: 56%

Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis

Buntinas

Goglin

Goodell

et al. 2009

2009 International Conference on Parallel Processing

View full text Add to dashboard Cite

The emergence of multicore processors raises the need to efficiently transfer large amounts of data between local processes. MPICH2 is a highly portable MPI implementation whose large-message communication schemes suffer from high CPU utilization and cache pollution because of the use of a double-buffering strategy, common to many MPI implementations. We introduce two strategies offering a kernel-assisted, single-copy model with support for noncontiguous and asynchronous transfers. The first one uses the now widely available vmsplice Linux system call; the second one further improves performance thanks to a custom kernel module called KNEM. The latter also offers I/OAT copy offload, which is dynamically enabled depending on both hardware cache characteristics and message size. These new solutions outperform the standard transfer method in the MPICH2 implementation when no cache is shared between the processing cores or when very large messages are being transferred. Collective communication operations show a dramatic improvement, and the IS NAS parallel benchmark shows a 25% speedup and better cache efficiency.

show abstract

Section: Related Workmentioning

confidence: 56%

Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis

Buntinas

Goglin

Goodell

et al. 2009

2009 International Conference on Parallel Processing

View full text Add to dashboard Cite

show abstract

“…In the context of highperformance computing, I/OAT improves inter-process communication such as shared-memory MPI implementations on multicore nodes [21]. Our I/OAT based local communication model is very similar but has the advantage of being transparently integrated into the OPEN-MX stack since the driver automatically switches from regular to local communication without needing any specific support in user-space.…”

Section: Discussion and Related Workmentioning

confidence: 99%

“…Indeed, OPEN-MX local communication is based on a system call where a direct copy is performed between the source process address space into the target. A comparable model has been presented in [21] as an extension to the MVAPICH MPI middleware. This design is actually nicely integrated into the OPEN-MX stack since all communications, either local or through the network, are managed by the driver through the same commands, and they return the same events to the userspace library.…”

Section: Offloading Synchronous Copiesmentioning

confidence: 99%

Improving message passing over Ethernet with I/OAT copy offload in Open-MX

Goglin

2008

2008 IEEE International Conference on Cluster Computing

View full text Add to dashboard Cite

To cite this version:Brice Goglin. Improving Message Passing over Ethernet with I/OAT Copy Offload in Open-MX. IEEE. Cluster 2008, Sep 2008, Tsukuba, Japan. 2008 Abstract-Open-MX is a new message passing layer implemented on top of the generic Ethernet stack of the Linux kernel. Open-MX works on all Ethernet hardware, but it suffers from expensive memory copy requirements on the receiver side due to the hardware's inability to deposit messages directly in the target application buffers.This article presents the implementation of an asynchronous memory copy offload in the Open-MX stack thanks to Intel I/O Acceleration Technology. The overlapping of large message fragment copies with the processing increases the receive throughput by 30 % while reducing the CPU usage by up to 40 %. It enables Open-MX to reach 10 gigabit/s Ethernet line rate for large messages.Open-MX large intra-node communication also benefits significantly from the I/OAT hardware since the performance of its onecopy-based local communication mechanism is almost doubled by using blocking I/OAT memory copies. By combining all these optimizations, the Open-MX large message performance on top of 10G hardware is now able to bridge the gap with the native Myrinet Express stack.

show abstract

“…Many works have recently considered the more general issue of copying memory regions in multicore systems using specific hardware [41,90], or how the memory management can play a significant role in the communication performance [40,84]. However, the interactions between simultaneously transferring the data to the Network Interface Card and obtaining an additional copy in the application space has not been addressed.…”

Section: Optimizing Sender-based Message Loggingmentioning

confidence: 98%

Fault-Tolerant MPI

Bouteiller

2015

Computer Communications and Networks

View full text Add to dashboard Cite

As supercomputers are entering an era of massive parallelism where the frequency of faults is increasing, the MPI standard remains distressingly vague on the consequence of failures on MPI communications. In this chapter, we present the spectrum of techniques that can be applied to enable MPI application recovery, ranging from fully automatic to completely user driven. First, we present the effective deployment of most advanced checkpoint/restart techniques within the MPI implementation, so that failed processors are automatically restarted in a consistent state with surviving processes, at a performance cost. Then, we investigate how MPI can support application-driven recovery techniques, and introduce a set of extensions to MPI that allow restoring communication capabilities, while maintaining the extreme level of performance to which MPI users have become accustomed.

show abstract

Efficient asynchronous memory copy operations on multi-core systems and I/OAT

Cited by 22 publications

References 7 publications

Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis

Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis

Improving message passing over Ethernet with I/OAT copy offload in Open-MX

Fault-Tolerant MPI

Contact Info

Product

Resources

About