Why is MPI so slow?

Raffenetti, Ken; Amer, Amer; Oden, Lena; Archer, Charles J; Bland, Wesley; Fujita, Hideaki; Guo, Yanfei; Janjusic, Tomislav; Durnov, Dmitry; Blocksome, Michael A; Si, Min; Seo, Sangmin; Langer, Akhil; Zheng, Gengbin; Takagi, Masamichi; Coffman, Paul; Jose, Jithin; Sur, Sayantan; Sannikov, Alexander; Oblomov, Sergey; Chuvelev, Michael; Hatanaka, Masayuki; Zhao, Xin; Fischer, Paul; Rathnayake, Thilina; Otten, Matt; Min, Misun; Balaji, Pavan

doi:10.1145/3126908.3126963

Cited by 23 publications

(4 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The data is normalized by reporting the number of DoF per node, so ideal weak scaling would correspond to coinciding lines. While the saturated performance is scaling well, giving a sustained performance of up to 4.4 PFlop/s, 2 most of the in-cache performance advantage is lost due to the communication latency over MPI, see also [62] for limits with MPI in PDE solvers. Defining the strong scaling limit as the point where throughput reduces to 80% of saturated performance [29], it is reached for wall times of 56 μs on 1 node.…”

Section: Performance-optimized Conjugate Gradient Methodsmentioning

confidence: 99%

ExaDG: High-Order Discontinuous Galerkin for the Exa-Scale

Arndt

Fehn

Kanschat

et al. 2020

Lecture Notes in Computational Science and Engineering

View full text Add to dashboard Cite

This text presents contributions to efficient high-order finite element solvers in the context of the project ExaDG, part of the DFG priority program 1648 Software for Exascale Computing (SPPEXA). The main algorithmic components are the matrix-free evaluation of finite element and discontinuous Galerkin operators with sum factorization to reach a high node-level performance and parallel scalability, a massively parallel multigrid framework, and efficient multigrid smoothers. The algorithms have been applied in a computational fluid dynamics context. The software contributions of the project have led to a speedup by a factor 3 − 4 depending on the hardware. Our implementations are available via the deal.II finite element library.

show abstract

Section: Performance-optimized Conjugate Gradient Methodsmentioning

confidence: 99%

ExaDG: High-Order Discontinuous Galerkin for the Exa-Scale

Arndt

Fehn

Kanschat

et al. 2020

Lecture Notes in Computational Science and Engineering

View full text Add to dashboard Cite

show abstract

“…Nekbone has been updated to include vector solutions, which allows amortization of message and memory latencies. Nekbone has been used for assessment of advanced architectures and for evaluation of light-weight MPI implementations on the ALCF BG/Q, Cetus, in collaboration with Argonne's MPICH team (Raffenetti and et al 2017).…”

Section: Nekbench and Nekbonementioning

confidence: 99%

Efficient Exascale Discretizations: High-Order Finite Element Methods

Kolev,

Fischer,

Min

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Efficient exploitation of exascale architectures requires rethinking of the numerical algorithms used in many large-scale applications. These architectures favor algorithms that expose ultra fine-grain parallelism and maximize the ratio of floating point operations to energy intensive data movement. One of the few viable approaches to achieve high efficiency in the area of PDE discretizations on unstructured grids is to use matrix-free / partially-assembled high-order finite element methods, since these methods can increase the accuracy and/or lower the computational time due to reduced data motion. In this paper we provide an overview of the research and development activities in the Center for Efficient Exascale Discretizations (CEED), a co-design center in the Exascale Computing Project that is focused on the development of nextgeneration discretization software and algorithms to enable a wide range of finite element applications to run efficiently on future hardware. CEED is a research partnership involving more than 30 computational scientists from two US national labs and five universities, including members of the Nek5000, MFEM, MAGMA and PETSc projects. We discuss the CEED co-design activities based on targeted benchmarks, miniapps and discretization libraries and our work on performance optimizations for large-scale GPU architectures. We also provide a broad overview of research and development activities in areas such as unstructured adaptive mesh refinement algorithms, matrix-free linear solvers, high-order data visualization, and list examples of collaborations with several ECP and external applications.

show abstract

“…Parallel applications are the dominant workload in highperformance computing (HPC) systems. Many of these parallel programs run across multiple compute nodes and processors and use the Message Passing Interface (MPI) for distributed communications and work distribution [1]- [3]. Effective management of MPI applications is thus vital for improving system utilization and application performance for HPC systems.…”

Section: Introductionmentioning

confidence: 99%

Faster and Scalable MPI Applications Launching

Dai

Dong

Xie

et al. 2024

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Distributed parallel MPI applications are the dominant workload in many high-performance computing systems. While optimizing MPI application execution is a well-studied field, little work has considered optimizing the initial MPI application launching phase, which incurs extensive cross-machine communications and synchronization. The overhead of MPI application launching can be expensive, accounting for over 200 million processor core hours and 15% of the user core time annually on the production Tianhe-2A supercomputer, which will increase as the number of parallel machines used grows. Therefore, it is critical to optimize the MPI application launching process. This paper presents a novel approach to optimizing the MPI application launch. Our approach adopts a location-aware address generation rule to eliminate the need for address exchange and a topology-aware global communication scheme to optimize cross-machine synchronization. We then design a new application launch procedure to support the proposed optimizations to further reduce the pressure of the shared I/O system. Our techniques have been deployed to production in the Tianhe-2A supercomputer and the Next Generation Tianhe Supercomputer. Experimental results show that our approach scales well and outperforms alternative schemes, reducing the MPI application launching time by over 29% with 320K MPI processes.

show abstract

Why is MPI so slow?

Cited by 23 publications

References 14 publications

ExaDG: High-Order Discontinuous Galerkin for the Exa-Scale

ExaDG: High-Order Discontinuous Galerkin for the Exa-Scale

Efficient Exascale Discretizations: High-Order Finite Element Methods

Faster and Scalable MPI Applications Launching

Contact Info

Product

Resources

About