2018 26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP) 2018
DOI: 10.1109/pdp2018.2018.00032
|View full text |Cite
|
Sign up to set email alerts
|

Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery

Abstract: Efficient utilization of today's high-performance computing (HPC) systems with complex hardware and software components requires that the HPC applications are designed to tolerate process failures at runtime. With low mean-time-tofailure (MTTF) of current and future HPC systems, long running simulations on these systems require capabilities for gracefully handling process failures by the applications themselves. In this paper, we explore the use of fault tolerance extensions to Message Passing Interface (MPI) … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
12
0
1

Year Published

2018
2018
2022
2022

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 17 publications
(13 citation statements)
references
References 19 publications
0
12
0
1
Order By: Relevance
“…Although not yet part of the standard, the proposed ULFM extensions have already been applied in a number of studies (see, e.g. Ali et al, 2014; Ashraf et al, 2018; Bland et al, 2013; Cantwell and Nielsen, 2019; Engwer et al, 2018; Fagg and Dongarra, 2000; Gamell et al, 2017a, 2017b; Losada et al, 2020; Teranishi and Heroux, 2014).…”
Section: Resilience Methodologiesmentioning
confidence: 99%
“…Although not yet part of the standard, the proposed ULFM extensions have already been applied in a number of studies (see, e.g. Ali et al, 2014; Ashraf et al, 2018; Bland et al, 2013; Cantwell and Nielsen, 2019; Engwer et al, 2018; Fagg and Dongarra, 2000; Gamell et al, 2017a, 2017b; Losada et al, 2020; Teranishi and Heroux, 2014).…”
Section: Resilience Methodologiesmentioning
confidence: 99%
“…Then, the MPIX Comm shrink routine is used to create a new communicator excluding the failed processes. An example of this kind of applications is found in [3], where an iterative application is rendered moldable by redistributing the rest of the dataset among the surviving processes.…”
Section: Typical Patternsmentioning
confidence: 99%
“…Ashraf et al [3] compare a shrinking and a non-shrinking solution for a fault-tolerant version of the generalized minimal residual (GMRES) algorithm. The algorithm already offers protection against silent data corruptions, and protection against hard errors is implemented combining ULFM and a backward global recovery using diskless checkpointing.…”
Section: Shrinking Solutionsmentioning
confidence: 99%
“…BB underutilization is also caused by the characteristics of the checkpoint/restart. HPC applications perform checkpoint with a fixed period [16]- [18], called checkpoint period, by repeating compute phase and I/O phase periodically. Unfortunately, as the checkpoint period ranges from tens of minutes to tens of hours, expensive BB resources keep idle for long compute phases.…”
Section: A Burst Buffer Underutilizationmentioning
confidence: 99%