Fine-Grained Multithreading Support for Hybrid Threaded MPI Programming

Balaji, Pavan; Buntinas, Darius; Goodell, David; Gropp, William; Thakur, Rajeev

doi:10.1177/1094342009360206

Cited by 53 publications

(32 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Work early in the project was performed in collaboration with both Argonne and the IBM Blue Gene team [10]. Work on fine grain multithreading support showed how to avoid excessive lock overhead in an MPI implementation [3,2]. Recent work included a new algorithm for efficient allocation of context ids in MPI fixes a subtle race condition in the algorithm that had been used in MPICH; this new algorithm retains the efficient behavior for the expected case [9].…”

Section: Some Of the Most Interesting Results From This Project Addrementioning

confidence: 99%

Final Report for Enhancing the MPI Programming Model for PetaScale Systems

Gropp

2013

View full text Add to dashboard Cite

show abstract

Section: Some Of the Most Interesting Results From This Project Addrementioning

confidence: 99%

Final Report for Enhancing the MPI Programming Model for PetaScale Systems

Gropp

2013

View full text Add to dashboard Cite

show abstract

“…In addition to this common object allocation layer used for all MPI objects, MPICH2 provides another small optimization above this layer for MPI_Request objects [3]. In general, an MPI implementation must allocate a request object for each communication operation such as MPI_Send or MPI_Recv.…”

Section: Mpich2 Internals Backgroundmentioning

confidence: 99%

“…For example, keeping track of |T | by atomically incrementing or decrementing a shared counter on every modification of T would incur a severe performance penalty. As mentioned in Section III, MPICH2 currently uses a thread-local storage optimization [3] to manage request allocation. This optimization eliminates virtually all contention from request allocation.…”

Section: B Reference Counting With Garbage Collection Hybridizationmentioning

confidence: 99%

“…To this end we present a minor variation on our threaded message rate benchmark [3] that sends and receives messages bidirectionally in order to provide the most parallel and efficient baseline possible. This benchmark measures the aggregate message rate for N threads in a single MPI process, each sending to and receiving from a corresponding peer process on a separate node.…”

Section: B the Neighbor Message Rate Benchmarkmentioning

confidence: 99%

See 1 more Smart Citation

Minimizing MPI Resource Contention in Multithreaded Multicore Environments

Goodell

Balaji

Buntinas

et al. 2010

2010 IEEE International Conference on Cluster Computing

Self Cite

View full text Add to dashboard Cite

Abstract-With the ever-increasing numbers of cores per node in high-performance computing systems, a growing number of applications are using threads to exploit shared memory within a node and MPI across nodes. This hybrid programming model needs efficient support for multithreaded MPI communication.In this paper, we describe the optimization of one aspect of a multithreaded MPI implementation: concurrent accesses from multiple threads to various MPI objects, such as communicators, datatypes, and requests. The semantics of the creation, usage, and destruction of these objects implies, but does not strictly require, the use of reference counting to prevent memory leaks and premature object destruction. We demonstrate how a naïve multithreaded implementation of MPI object management via reference counting incurs a significant performance penalty. We then detail two solutions that we have implemented in MPICH2 to mitigate this problem almost entirely, including one based on a novel garbage collection scheme. In our performance experiments, this new scheme improved the multithreaded messaging rate by up to 31% over the naïve reference counting method.

show abstract

“…When a lock is released, a thread waiting for the global lock gets access to the lock and performs progress on its MPI communication. We are also developing a more efficient version of MPICH2 that supports finer-grained locks [1].…”

Section: Threadsmentioning

confidence: 99%

Implementing MPI on Windows: Comparison with Common Approaches on Unix

Krishna

Balaji

Lusk

et al. 2010

Recent Advances in the Message Passing Interface

Self Cite

View full text Add to dashboard Cite

Abstract. Commercial HPC applications are often run on clusters that use the Microsoft Windows operating system and need an MPI implementation that runs efficiently in the Windows environment. The MPI developer community, however, is more familiar with the issues involved in implementing MPI in a Unix environment. In this paper, we discuss some of the differences in implementing MPI on Windows and Unix, particularly with respect to issues such as asynchronous progress, process management, shared-memory access, and threads. We describe how we implement MPICH2 on Windows and exploit these Windows-specific features while still maintaining large parts of the code common with the Unix version. We also present performance results comparing the performance of MPICH2 on Unix and Windows on the same hardware. For zero-byte MPI messages, we measured excellent shared-memory latencies of 240 and 275 nanoseconds on Unix and Windows, respectively.

show abstract

Fine-Grained Multithreading Support for Hybrid Threaded MPI Programming

Cited by 53 publications

References 12 publications

Final Report for Enhancing the MPI Programming Model for PetaScale Systems

Final Report for Enhancing the MPI Programming Model for PetaScale Systems

Minimizing MPI Resource Contention in Multithreaded Multicore Environments

Implementing MPI on Windows: Comparison with Common Approaches on Unix

Contact Info

Product

Resources

About