Process Mapping for MPI Collective Communications

Zhang, Jin; Zhai, Jidong; Chen, Wenguang; Zheng, Weimin

doi:10.1007/978-3-642-03869-3_11

Cited by 32 publications

(13 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thanks to an adequate placement policy enforced by both these mapping and binding parameters, it is possible to take into account the physical topology and reduce the communication costs for instance [14,15]. This is also used to improve collective communication performance [16], Unfortunately, these options are totally non-standard and can even change from one version of a process manager to the other.…”

Section: Process Managers and Process Mappingmentioning

confidence: 99%

Hardware topology management in MPI applications through hierarchical communicators

Goglin

Jeannot

Mansouri³

et al. 2018

Parallel Computing

View full text Add to dashboard Cite

The MPI standard is a major contribution in the landscape of parallel programming. Since its inception in the mid 90s it has ensured portability and performance for parallel applications on a wide spectrum of machines and architectures. With the advent of multicore machines, understanding and taking into account the underlying physical topology and memory hierarchy have become of paramount importance. On the other hand, providing abstract mechanisms to manipulate the hardware topology is also fundamental. The MPI standard in its current state, however, and despite recent evolutions is still unable to offer mechanisms to achieve this. In this paper, we detail several additions to the standard for building new MPI communicators corresponding to hardware hierarchy levels. It provides the user with tools to address hardware topology and locality issues while improving application performance.

show abstract

Section: Process Managers and Process Mappingmentioning

confidence: 99%

Hardware topology management in MPI applications through hierarchical communicators

Goglin

Jeannot

Mansouri³

et al. 2018

Parallel Computing

View full text Add to dashboard Cite

show abstract

“…Various MPI process mapping methods have been proposed in the related studies. Most of the methods rely on offline profiling to trace communication between processes and to analyze the communication behaviors of the applications [2,9,21,31]. The main drawback of these methods is the requirement of offline profiling, which has a high overhead and is potentially timeconsuming.…”

Section: Introductionmentioning

confidence: 99%

Online MPI Process Mapping for Coordinating Locality and Memory Congestion on NUMA Systems

Agung

Amrizal

Egawa

et al. 2020

JSFI

View full text Add to dashboard Cite

Mapping MPI processes to processor cores, called process mapping, is crucial to achieving the scalable performance on multi-core processors. By analyzing the communication behavior among MPI processes, process mapping can improve the communication locality, and thus reduce the overall communication cost. However, on modern non-uniform memory access (NUMA) systems, the memory congestion problem could degrade performance more severely than the locality problem because heavy congestion on shared caches and memory controllers could cause long latencies. Most of the existing work focus only on improving the locality or rely on offline profiling to analyze the communication behavior. We propose a process mapping method that dynamically performs the process mapping for adapting to communication behaviors while coordinating the locality and memory congestion. Our method works online during the execution of an MPI application. It does not require modifications to the application, previous knowledge of the communication behavior, or changes to the hardware and operating system. Experimental results show that our method can achieve performance and energy efficiency close to the best static mapping method with low overhead to the application execution. In experiments with the NAS parallel benchmarks on a NUMA system, the performance and total energy improvements are up to 34% (18.5% on average) and 28.9% (13.6% on average), respectively. In experiments with two GROMACS applications on a larger NUMA system, the average improvements in performance and total energy consumption are 21.6% and 12.6%, respectively.

show abstract

“…One consists in implementing codes that take into account the system characteristics [5]- [8], for instance minimizing the number of messages across the network or using blocks of data that fit in the caches to avoid cache misses. The other approach maps the processes to specific cores to improve the performance without changing the codes [9]- [11]. Knowledge of the topology of the machine and of some hardware parameters is necessary in both approaches.…”

Section: Introductionmentioning

confidence: 99%

Servet: A benchmark suite for autotuning on multicore clusters

González-Domínguez

Taboada

Fraguela

et al. 2010

2010 IEEE International Symposium on Parallel &Amp; Distributed Processing (IPDPS)

View full text Add to dashboard Cite

The growing complexity in computer system hierarchies due to the increase in the number of cores per processor, levels of cache (some of them shared) and the number of processors per node, as well as the high-speed interconnects, demands the use of new optimization techniques and libraries that take advantage of their features.In this paper Servet, a suite of benchmarks focused on detecting a set of parameters with high influence in the overall performance of multicore systems, is presented. These benchmarks are able to detect the cache hierarchy, including their size and which caches are shared by each core, bandwidths and bottlenecks in memory accesses, as well as communication latencies among cores. These parameters can be used by autotuned codes to increase their performance in multicore clusters. Experimental results using different representative systems show that Servet provides very accurate estimates of the parameters of the machine architecture.

show abstract

Process Mapping for MPI Collective Communications

Cited by 32 publications

References 16 publications

Hardware topology management in MPI applications through hierarchical communicators

Hardware topology management in MPI applications through hierarchical communicators

Online MPI Process Mapping for Coordinating Locality and Memory Congestion on NUMA Systems

Servet: A benchmark suite for autotuning on multicore clusters

Contact Info

Product

Resources

About