Designing an Efficient Kernel-Level and User-Level Hybrid Approach for MPI Intra-Node Communication on Multi-Core Systems

Chai, Lei; Lai, Ping; Jin, Hyun-Wook; Panda, Dhabaleswar K.

doi:10.1109/icpp.2008.16

Cited by 20 publications

(19 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For this reason, all IMB tests in the rest of the article will be presented with offcache enabled, assuming it better represents the performance that real applications may expect. While this methodology shows lower throughput than [3], [4], it fortunately brings comparable behaviors, especially regarding the threshold that determines when to switch from NEMESIS to KNEM: KNEM becomes interesting once the message size passes 16 KiB. It is also worth noticing here that I/OAT copy offload brings interesting performance improvements (up to 80%) as soon as KNEM is used.…”

Section: B Impact Of Cache Sharingmentioning

confidence: 95%

“…It also pollutes the caches by evicting application data from it as the copy operation is being performed [8]. In the end, this strategy shows very interesting latency for small messages but it is not recommended for large messages [3], [4].…”

Section: B Traditional Double-copy Implementationmentioning

confidence: 99%

“…The aggregated performance is thus limited by the memory bus and not by the actual copy implementation. However, since the threshold does not seem to vary with the number of processes, we 3 Computed using the total amount of data transfered for each collective operation. assume that no memory bus saturation occurs for medium messages (up to 1 MiB).…”

Section: Collective Operationsmentioning

confidence: 99%

“…Previous work [3], [4] introduced operating system assistance as a way to improve large message throughput. We present an in-depth study of this solution in the context of complex shared-memory machines.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Optimizing MPI communication within large multicore nodes with kernel assistance

Moreaud

Goglin

Namyst

et al. 2010

2010 IEEE International Symposium on Parallel &Amp; Distributed Processing, Workshops and PHD Forum (IPDPSW)

View full text Add to dashboard Cite

Abstract-As the number of cores per node increases in modern clusters, intra-node communication efficiency becomes critical to application performance. We present a study of the traditional double-copy model in MPICH2 and a kernelassisted single-copy strategy with KNEM on different sharedmemory hosts with up to 96 cores.We show that KNEM suffers less from process placement on these complex architectures. It improves throughput up to a factor of 2 for large messages for both point-to-point and collective operations, and significantly improves NPB execution time. We detail when to switch from one strategy to the other depending on the communication pattern and we show that I/OAT copy offload only appears to be an interesting solution for older architectures.

show abstract

Section: B Impact Of Cache Sharingmentioning

confidence: 95%

Section: B Traditional Double-copy Implementationmentioning

confidence: 99%

Section: Collective Operationsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Optimizing MPI communication within large multicore nodes with kernel assistance

Moreaud

Goglin

Namyst

et al. 2010

2010 IEEE International Symposium on Parallel &Amp; Distributed Processing, Workshops and PHD Forum (IPDPSW)

View full text Add to dashboard Cite

show abstract

“…However, it does not support I/OAT copy offload, vectorial buffers, or asynchronous data transfer. It has been used within MVAPICH2 with configurable thresholds for switching from the usual two-copies to the kernel-based, single-copy model [7]. However, it does not provide any automatic threshold, whereas our KNEM LMT dynamically computes its thresholds depending on the hardware characteristics.…”

Section: Related Workmentioning

confidence: 99%

Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis

Buntinas

Goglin

Goodell

et al. 2009

2009 International Conference on Parallel Processing

View full text Add to dashboard Cite

The emergence of multicore processors raises the need to efficiently transfer large amounts of data between local processes. MPICH2 is a highly portable MPI implementation whose large-message communication schemes suffer from high CPU utilization and cache pollution because of the use of a double-buffering strategy, common to many MPI implementations. We introduce two strategies offering a kernel-assisted, single-copy model with support for noncontiguous and asynchronous transfers. The first one uses the now widely available vmsplice Linux system call; the second one further improves performance thanks to a custom kernel module called KNEM. The latter also offers I/OAT copy offload, which is dynamically enabled depending on both hardware cache characteristics and message size. These new solutions outperform the standard transfer method in the MPICH2 implementation when no cache is shared between the processing cores or when very large messages are being transferred. Collective communication operations show a dramatic improvement, and the IS NAS parallel benchmark shows a 25% speedup and better cache efficiency.

show abstract

Designing truly one-sided MPI-2 RMA intra-node communication on multi-core systems

2010

Self Cite

View full text Add to dashboard Cite

The increasing popularity of multi-core processors has made MPI intra-node communication, including the intra-node RMA (Remote Memory Access) communication, a critical component in high performance computing. MPI-2 RMA model includes one-sided data transfer and synchronization operations. Existing designs in popularly used MPI stacks do not provide truly one-sided intranode RMA communication. They are built on top of twosided send-receive operations, therefore suffering from overheads of two-sided communication and dependency on the remote side. In this paper, we enhance existing shared memory mechanisms to design truly one-sided synchronization. In addition, we design truly one-sided intra-node data transfer using two kernel based direct copy alternatives: basic kernel-assisted approach and I/OAT-assisted approach. Our new design eliminates the overhead of using two-sided operations and eliminates the involvement from the remote side. We also propose a series of benchmarks to evaluate various performance aspects over multi-core architectures (Intel Clovertown, Intel Nehalem and AMD Barcelona). The results show that the new design obtains up to 39% lower latency for small and medium messages and demonstrates 29% improvement in large message bandwidth. Moreover, it provides superior performance in terms of better scalability, reduced cache misses, higher resilience to process skew and increased computation and communication overlap. Finally, up to 10% performance benefits is demonstrated for a real scientific application AWM-Olsen.

show abstract

Designing an Efficient Kernel-Level and User-Level Hybrid Approach for MPI Intra-Node Communication on Multi-Core Systems

Cited by 20 publications

References 9 publications

Optimizing MPI communication within large multicore nodes with kernel assistance

Optimizing MPI communication within large multicore nodes with kernel assistance

Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis

Designing truly one-sided MPI-2 RMA intra-node communication on multi-core systems

Contact Info

Product

Resources

About