Optimizing MPI communication within large multicore nodes with kernel assistance

Moreaud, Stéphanie; Goglin, Brice; Namyst, Raymond; Goodell, David

doi:10.1109/ipdpsw.2010.5470849

Cited by 15 publications

(13 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This approach has already proved to be valuable to increase point-to-point bandwidth between processes communicating over shared memory [5]. However, beyond the free performance upgrade offered by using more efficient point-to-point operations [14], hierarchy aware MPI collective components need more control over the underlying memory copy mechanism to reach their full potential. In this paper, we present new collective algorithms that take into account the specificities of kernel assisted memory copies, and require a new feature compared to state of the art kernel assisted copies: directional control of transfers.…”

Section: Related Workmentioning

confidence: 99%

“…While the beneficial effect of KNEM on point-topoint performance also translates into collective improvements [14], we have identified a series of additional optimizations that further boost collective communication performance. They require that the collective component has more control over the movement of data (vs. simply using MPI point-to-point primitives): (1 Because control of the kernel module is delegated to the point-to-point MPI message passing engine, using inter-process kernel-assist memory copies results in the same data region being registered multiple times when sent to different destination processes.…”

Section: A Issues With Mpi Collective Operationsmentioning

confidence: 99%

See 1 more Smart Citation

Kernel Assisted Collective Intra-node MPI Communication among Multi-Core and Many-Core CPUs

Bosilca

Bouteiller

et al. 2011

2011 International Conference on Parallel Processing

Self Cite

View full text Add to dashboard Cite

Section: Related Workmentioning

confidence: 99%

Section: A Issues With Mpi Collective Operationsmentioning

confidence: 99%

Kernel Assisted Collective Intra-node MPI Communication among Multi-Core and Many-Core CPUs

Bosilca

Bouteiller

et al. 2011

2011 International Conference on Parallel Processing

Self Cite

View full text Add to dashboard Cite

“…KNEM collective [14] is implemented as an Open MPI's intra-node collective component, which is directly based on kernel single-copy module: KNEM. The details about KNEM copy can be found in the papers [15], [16]. KNEM-based collectives mainly accelerate large messages' collective communication, and not small messages, because trapping into the kernel and distributing cookies introduce an overhead (which is equivalent to a 16KB broadcast or a 2KB Allgather on the platforms described above) [14].…”

Section: A Distance Between Processesmentioning

confidence: 99%

Process Distance-Aware Adaptive MPI Collective Communications

Hérault

Bosilca

et al. 2011

2011 IEEE International Conference on Cluster Computing

View full text Add to dashboard Cite

Message Passing Interface (MPI) implementations provide a great flexibility to allow users to arbitrarily bind processes to computing cores to fully exploit clusters of multicore/many-core nodes. An intelligent process placement can optimize application performance according to underlying hardware architecture and the application's communication pattern. However, such static process placement optimization can't help MPI collective communication, whose topology is dynamic with members in each communicator. Conversely, a mismatch between the collective communication topology, the underlying hardware architecture and the process placement often happens due to the MPI's limited capabilities of dealing with complex environments.This paper proposes an adaptive collective communication framework by combining process distance, underlying hardware topologies, and runtime communicator together. Based on this information, an optimal communication topology will be generated to guarantee maximum bandwidth for each MPI collective operation regardless of process placement. Based on this framework, two distance-aware adaptive intra-node collective operations (Broadcast and Allgather) are implemented as examples inside Open MPI's KNEM collective component. The awareness of process distance helps these two operations construct optimal runtime topologies and balance memory accesses across memory nodes. The experiments show these two distance-aware collective operations provide better and more stable performance than current collectives in Open MPI regardless of process placement.

show abstract

“…The KNEM assisted approach outperforms the standard transfer method in the MPICH2 implementation when no cache is shared between the processing cores, or when very large messages are being transferred. Even simply using KNEM assisted point-to-point communication underneath collective communication achieved a significant improvement [1,2]. Within Open MPI, a similar approach was implemented, with further emphasis on auto-tuning and performance portability [5].…”

Section: Related Workmentioning

confidence: 99%

“…MPICH2, since version 1.1.1, uses KNEM in the DMA LMT to improve large message performance within a single node. The work in [1,2] has shown that KNEM-enabled MPI communication significantly improves the performance of some micro-and macro-benchmarks. However, the performance of real scientific applications using KNEM-enabled MPI communication has yet to be asserted.…”

Section: Introductionmentioning

confidence: 99%

Impact of Kernel-Assisted MPI Communication over Scientific Applications: CPMD and FFTW

Bouteiller

Bosilca

et al. 2011

Recent Advances in the Message Passing Interface

View full text Add to dashboard Cite

Abstract. Collective communication is one of the most powerful message passing concepts, enabling parallel applications to express complex communication patterns while allowing the underlying MPI to provide efficient implementations to minimize the cost of the data movements. However, with the increase in the heterogeneity inside the nodes, more specifically the memory hierarchies, harnessing the maximum compute capabilities becomes increasingly difficult. This paper investigates the impact of kernel-assisted MPI communication, over two scientific applications: 1) Car-Parrinello molecular dynamics(CPMD), a chemical molecular dynamics application, and 2) FFTW, a Discrete Fourier Transform (DFT). By focusing on the usage of Message Passing Interface (MPI), we found the communication characteristics and patterns of each application. Our experiments indicate that the quality of the collective communication implementation on a specific machine plays a critical role on the overall application performance.

show abstract

Optimizing MPI communication within large multicore nodes with kernel assistance

Cited by 15 publications

References 16 publications

Kernel Assisted Collective Intra-node MPI Communication among Multi-Core and Many-Core CPUs

Kernel Assisted Collective Intra-node MPI Communication among Multi-Core and Many-Core CPUs

Process Distance-Aware Adaptive MPI Collective Communications

Impact of Kernel-Assisted MPI Communication over Scientific Applications: CPMD and FFTW

Contact Info

Product

Resources

About