Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis

Buntinas, Darius; Goglin, Brice; Goodell, David; Mercier, Guillaume; Moreaud, Stéphanie

doi:10.1109/icpp.2009.22

Cited by 50 publications

(58 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It also pollutes the caches by evicting application data from it as the copy operation is being performed [8]. In the end, this strategy shows very interesting latency for small messages but it is not recommended for large messages [3], [4].…”

Section: B Traditional Double-copy Implementationmentioning

confidence: 99%

See 1 more Smart Citation

Optimizing MPI communication within large multicore nodes with kernel assistance

Moreaud

Goglin

Namyst

et al. 2010

2010 IEEE International Symposium on Parallel &Amp; Distributed Processing, Workshops and PHD Forum (IPDPSW)

Self Cite

View full text Add to dashboard Cite

Abstract-As the number of cores per node increases in modern clusters, intra-node communication efficiency becomes critical to application performance. We present a study of the traditional double-copy model in MPICH2 and a kernelassisted single-copy strategy with KNEM on different sharedmemory hosts with up to 96 cores.We show that KNEM suffers less from process placement on these complex architectures. It improves throughput up to a factor of 2 for large messages for both point-to-point and collective operations, and significantly improves NPB execution time. We detail when to switch from one strategy to the other depending on the communication pattern and we show that I/OAT copy offload only appears to be an interesting solution for older architectures.

show abstract

Section: B Traditional Double-copy Implementationmentioning

confidence: 99%

“…Previous work [3], [4] introduced operating system assistance as a way to improve large message throughput. We present an in-depth study of this solution in the context of complex shared-memory machines.…”

Section: Introductionmentioning

confidence: 99%

Optimizing MPI communication within large multicore nodes with kernel assistance

Moreaud

Goglin

Namyst

et al. 2010

2010 IEEE International Symposium on Parallel &Amp; Distributed Processing, Workshops and PHD Forum (IPDPSW)

Self Cite

View full text Add to dashboard Cite

show abstract

“…This approach can improve performance for large-message transfers among processes that do not share a cache. A variety of standard and nonstandard methods for doing so are available on Unix [2]. Windows provides an OS service for directly accessing the address space of a specified process, provided the process has appropriate security privileges.…”

Section: Intranode Communicationmentioning

confidence: 99%

Implementing MPI on Windows: Comparison with Common Approaches on Unix

Krishna

Balaji

Lusk

et al. 2010

Recent Advances in the Message Passing Interface

View full text Add to dashboard Cite

Abstract. Commercial HPC applications are often run on clusters that use the Microsoft Windows operating system and need an MPI implementation that runs efficiently in the Windows environment. The MPI developer community, however, is more familiar with the issues involved in implementing MPI in a Unix environment. In this paper, we discuss some of the differences in implementing MPI on Windows and Unix, particularly with respect to issues such as asynchronous progress, process management, shared-memory access, and threads. We describe how we implement MPICH2 on Windows and exploit these Windows-specific features while still maintaining large parts of the code common with the Unix version. We also present performance results comparing the performance of MPICH2 on Unix and Windows on the same hardware. For zero-byte MPI messages, we measured excellent shared-memory latencies of 240 and 275 nanoseconds on Unix and Windows, respectively.

show abstract

“…The LiMIC [7] kernel module can decrease the number of necessary memory copies to one by doing the memory movement with kernel access rights. KNEM [8] is a similar kernel module that also features DMA (Direct Memory Access) copy by using Intel I/O acceleration technique (I/OAT). DMA copy can decrease cache pollution and CPU noise from communication.…”

Section: Related Workmentioning

confidence: 99%

Locality and Topology Aware Intra-node Communication among Multicore CPUs

Bosilca

Bouteiller

et al. 2010

Recent Advances in the Message Passing Interface

View full text Add to dashboard Cite

Abstract. A major trend in HPC is the escalation toward manycore, where systems are composed of shared memory nodes featuring numerous processing units. Unfortunately, with scale comes complexity, here in the form of non-uniform memory accesses and cache hierarchies. For most HPC applications, harnessing the power of multicores is hindered by the topology oblivious tuning of the MPI library. In this paper, we propose a framework to tune every type of shared memory communications according to locality and topology. An implementation inside Open MPI is evaluated experimentally and demonstrates significant speedups compared to vanilla Open MPI and MPICH2.

show abstract

Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis

Cited by 50 publications

References 12 publications

Optimizing MPI communication within large multicore nodes with kernel assistance

Optimizing MPI communication within large multicore nodes with kernel assistance

Implementing MPI on Windows: Comparison with Common Approaches on Unix

Locality and Topology Aware Intra-node Communication among Multicore CPUs

Contact Info

Product

Resources

About