KNEM: A generic and scalable kernel-assisted intra-node MPI communication framework

Goglin, Brice; Moreaud, Stéphanie

doi:10.1016/j.jpdc.2012.09.016

Cited by 61 publications

(15 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In shared memory, the operating system allows the direct movement of data between two processes in just one transfer, for instance, using KNEM [41] or LiMIC [42] kernel modules in MPI, and high performance networks reduce the number of transfers through the use of RDMA (Remote Direct Memory Access) mechanisms [43].…”

Section: Modeling a Transmissionmentioning

confidence: 99%

Extending τ-Lop to model concurrent MPI communications in multicore clusters

Rico‐Gallego

Díaz-Martín

Lastovetsky

2016

Future Generation Computer Systems

View full text Add to dashboard Cite

Section: Modeling a Transmissionmentioning

confidence: 99%

Extending τ-Lop to model concurrent MPI communications in multicore clusters

Rico‐Gallego

Díaz-Martín

Lastovetsky

2016

Future Generation Computer Systems

View full text Add to dashboard Cite

“…However, double copies are required for point-to-point communication. Goglin et al [23] support efficient intra-node MPI communication for large messages, by using kernel-assisted direct copies between processes. However, for small messages (such as those used in PDES), they observe that the standard two-copy implementation performs better.…”

Section: Mpi On Shared Memory Architecturesmentioning

confidence: 99%

Parallel Discrete Event Simulation for Multi-Core Systems: Analysis and Optimization

Wang

Jagtap

Abu-Ghazaleh

et al. 2014

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Abstract-Parallel Discrete Event Simulation (PDES) can substantially improve the performance and capacity of simulation, allowing the study of larger, more detailed models, in less time. PDES is a fine-grained parallel application whose performance and scalability is limited by communication latencies. Traditionally, PDES simulation kernels use message passing; often these simulators are written for distributed environments, and shared memory is used to optimize message passing among processes on the same machine. In this paper, we develop, characterize and optimize a thread-based version of a PDES simulator on three representative multi-core platforms. The multi-threaded implementation eliminates multiple message copying and significantly minimizes synchronization delays. We study the performance of the simulator on three hardware platforms: an Intel Core i7 machine, and a 48-core AMD Opteron Magny-Cours system, and a 64-core Tilera TilePro64. We discover that the three platforms encounter substantially different bottlenecks because of their different architectures. We identify these bottlenecks and propose mechanisms to overcome them. Our results show that multithreaded implementation improves the performance over an MPI-based version by up to a factor of 3 on the Core i7, 1.4 on the AMD Magny-Cours, and 2.8 on the Tilera Tile64.

show abstract

“…Multiple data transfers strategies have been proposed [1], including relying on the external network interface, on specific network drivers, on custom operating system features [2], or on user-level techniques such as shared buffers and pipelining. This was still an active research area recently through platform-independent directcopy mechanims such as LiMIC [3] and KNEM [4], and the inclusion of Cross Memory Attach [5] in the Linux kernel.…”

Section: B Too Many Configuration Optionsmentioning

confidence: 99%

“…These benchmarks are run using our mbench framework. 4 It offers easy ways to setup memory buffers to specific cache states and compute the corresponding memory access throughputs for different numbers of threads. Measuring the memory throughput for different buffer sizes also capture the performance and sizes of each level of cache, which allows us to explicitly ignore them in the model.…”

Section: Modeling Communication By Combining Microbenchmarksmentioning

confidence: 99%

Analysis of MPI Shared-Memory Communication Performance from a Cache Coherence Perspective

Putigny

Ruelle

Goglin

2014

2014 IEEE International Parallel &Amp; Distributed Processing Symposium Workshops

Self Cite

View full text Add to dashboard Cite

Shared memory MPI communication is an important part of the overall performance of parallel applications. However understanding the behavior of these data transfers is difficult because of the combined complexity of modern memory architectures with multiple levels of caches and complex cache coherence protocols, of MPI implementations, and of application needs.We analyze shared memory MPI communication from a cache coherence perspective through a new memory model. It captures the memory architecture characteristics with microbenchmarks that exhibit the limitations of the memory accesses involved in the data transfer. We model the performance of intra-node communication without requiring complex analytical models. The advantage of the approach consists in not requiring deep knowledge of rarely documented hardware features such as caching policies or prefetchers that make modeling modern memory subsystems hardly feasible.Our qualitative analysis based on this result leads to a better understanding of shared memory communication performance for scientific computing. We then discuss some possible optimizations such as buffer reuse order, cache flushing, and nontemporal instructions that could be used by MPI implementers.

show abstract

KNEM: A generic and scalable kernel-assisted intra-node MPI communication framework

Cited by 61 publications

References 29 publications

Extending τ-Lop to model concurrent MPI communications in multicore clusters

Extending τ-Lop to model concurrent MPI communications in multicore clusters

Parallel Discrete Event Simulation for Multi-Core Systems: Analysis and Optimization

Analysis of MPI Shared-Memory Communication Performance from a Cache Coherence Perspective

Contact Info

Product

Resources

About