Designing truly one-sided MPI-2 RMA intra-node communication on multi-core systems

Lai, Ping; Sur, Sayantan; Panda, Dhabaleswar K.

doi:10.1007/s00450-010-0115-3

Cited by 14 publications

(24 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Lai et al present an OSC implementation which makes use of kernel and hardware facilities to accelerate the interprocess message transfer [14]. In a follow-up work, the authors designed an OSC implementation for conventional shared memory systems which provide hardware cache coherence [20].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Software-managed Cache Coherence for fast One-Sided Communication

Christgau

Schnor

2016

Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores

View full text Add to dashboard Cite

The ongoing many-core design aims at core counts where cache coherence becomes a serious challenge. Therefore, this paper discusses how one-sided communication can be implemented on a non-cache coherent many-core CPU. The Intel SCC serves as an exemplary hardware architecture. The presented approach is based on software-managed cache coherence for MPI one-sided communication. The prototype implementation delivers a PUT performance of up to five times faster than the default message-based approach and reveals a reduction of the communication costs for the NPB 3D FFT by a factor of five. Further, the paper identifies drawbacks of the SCC's architecture and derives conclusions for future architectures.

show abstract

Section: Related Workmentioning

confidence: 99%

“…Since PSCW is more appropriate for the regarded FFT benchmark, a PSCW synchronization scheme based on bit vectors like proposed in [14] was tuned for the SCC. A detailed discussion is out of the scope of this paper.…”

Section: Synchronizationmentioning

confidence: 99%

Software-managed Cache Coherence for fast One-Sided Communication

Christgau

Schnor

2016

Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores

View full text Add to dashboard Cite

show abstract

“…Additionally, MPI researchers conduct long-term optimization works on the MPI RMA [16][17][18][19] and collective operations [20][21][22]. Those optimizations are also liable to be applied onto our DART and then benefit the performance of applications.…”

Section: Related Workmentioning

confidence: 99%

Application Productivity and Performance Evaluation of Transparent Locality-aware One-sided Communication Primitives

Zhou

Gracia

2017

IJNC

View full text Add to dashboard Cite

Nowadays, the individual nodes of a distributed parallel computer consist of multi-or manycore processors allowing to execute more than one process per node. The large difference in communication speed within a node through shared memory, versus across nodes through the network interconnect, requires to use locality-aware communication schemes for any efficient distributed application. However, writing an efficient locality-aware MPI code is complex and error-prone, because the developer has to use very different APIs for communication operations within and across nodes, respectively, and manage inter-process synchronization. In this paper, we analyze and enhance a recent one-sided communication model, namely DART-MPI, which is implemented on top of MPI-3. In this runtime system, the complexities of handling locality-awareness of MPI memory access operations, either remote or local, and the related synchronization calls are hidden inside the related DART-MPI interfaces resulting in concise code and improved application and developer productivity. We have carried out in-depth evaluation of our DART-MPI system. Foremost, a micro benchmark is conducted to help understanding the prime performance overhead of implementing APIs in DART-MPI system, which is small and becomes negligible with the growing message sizes. We then compare the performance of DART-MPI and flat MPI without locality awareness, in particular blocking and non-blocking memory operations, using a realistic scientific application on a large-scale supercomputer. The comparison demonstrates that in most cases the DART-MPI version of this application shows better performance than the flat MPI version. Further, we compare the DART-MPI version to a functionally equivalent MPI version, which thus includes code to deal with data-locality, and show that DART-MPI realizes almost the full potential of highly optimized MPI while maintaining high productivity for non-expert programmers.

show abstract

“…Since MPI was first implemented in 1992, it has been implemented and optimized on different computing environments, e.g., multicore processors [18], [20], [11], [5], wide area network [17], and Infiniband networks [13], [28]. Our idea of grouping is partially inspired by the grouping algorithms in those previous studies.…”

Section: Mpimentioning

confidence: 99%

Network Performance Aware MPI Collective Communication Operations in the Cloud

Gong

Zhong

2015

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

This paper examines the performance of collective communication operations in Message Passing Interfaces (MPI) in the cloud computing environment. The awareness of network topology has been a key factor in performance optimizations for existing MPI implementations. However, virtualization in the cloud environment not only hides the network topology information from the users, but also causes traffic interference and dynamics to network performance. Existing topology-aware optimizations are no longer feasible in the cloud environment. Therefore, we develop novel network performance aware algorithms for a series of collective communication operations including broadcast, reduce, gather and scatter. We further implement two common applications, N-body and conjugate gradient (CG). We have conducted our experiments with two complementary methods (on Amazon EC2 and simulations). Our experimental results show that the network performance awareness results in 25.4% and 28.3% performance improvement over MPICH2 on Amazon EC2 and on simulations, respectively. Evaluations on N-body and CG show 41.6% and 14.3% respectively on application performance improvement. Index Terms-Cloud Computing, MPI, Collective Operations, Network Performance Optimizations INTRODUCTIONCloud computing has emerged as a popular computing paradigm for many distributed and parallel applications. Message Passing Interface (MPI) is a common and key software component in distributed and parallel applications, and its performance is the key factor for the network communication efficiency. This paper investigates whether and how we can improve the performance of MPI in the cloud computing environment.Since collective communications are the most important MPI operations for the system performance [13], [14], [17], this paper focuses on the efficiency of MPI collective communication operations. Network topology aware algorithms have been applied to optimize the performance of collective communication operations [13], [28], [26], [14], [17]. Most of the studies adopt tree-based algorithms, since the network topology is often tree-structured. The essential idea of those algorithms is to obtain the topology information with hardware • Yifan Gong, Bingsheng He and Jianlong Zhong are with

show abstract

Designing truly one-sided MPI-2 RMA intra-node communication on multi-core systems

Cited by 14 publications

References 14 publications

Software-managed Cache Coherence for fast One-Sided Communication

Software-managed Cache Coherence for fast One-Sided Communication

Application Productivity and Performance Evaluation of Transparent Locality-aware One-sided Communication Primitives

Network Performance Aware MPI Collective Communication Operations in the Cloud

Contact Info

Product

Resources

About