Failure recovery for bulk synchronous applications with MPI stages

Sultana, Nawrin; Rüfenacht, Martin; Skjellum, Anthony; Laguna, Ignacio; Mohror, Kathryn

doi:10.1016/j.parco.2019.02.007

Cited by 10 publications

(5 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For instance, with MPI Stages (a Reinit-type model) [1], [2], the approach is to quiesce MPI and then call the checkpoint library for application state preservation plus optional MPI-Stages-related serialization strings for future recovery; then, the approach is to call a separate MPI-Stages internal state checkpoint (including application objects serialized). This concept of operations apparently does not contemplate that the CPR library actually uses MPI itself, thereby changing the internal state of MPI.…”

Section: Identifying the Issuesmentioning

confidence: 99%

“…The application has to interpret that code in terms of the fault tolerant model it is itself using. 2) directly initiate recovery in the same model that the application uses 2 . An alternative to the error-return code is to recompile the CPR library with C++, while still using the MPI C interface, but augmenting it with additional C++ features that provide fault-tolerant extensions (e.g., exceptions) [6].…”

Section: Identifying the Issuesmentioning

confidence: 99%

“…ULFM[3], Reinit[1], MPI Stages[2],[4], and FA-MPI[5] are deemed common models since discussed in the MPI Forum's Fault Tolerance Working Group.2 It will probably be needed to code variations into the CPR library, but it seems expensive at this time to force the library to have a new, manual coding approach for every application FT mode chosen…”

mentioning

confidence: 99%

See 2 more Smart Citations

Checkpoint-Restart Libraries Must Become More Fault Tolerant

Skjellum¹,

Schafer²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Production MPI codes need checkpoint-restart (CPR) support. Clearly, checkpoint-restart libraries must be fault tolerant lest they open up a window of vulnerability for failures with byzantine outcomes. But, certain popular libraries that leverage MPI are evidently not fault tolerant. Nowadays, fault detection with automatic recovery without batch requeueing is a strong requirement for production environments. Thus, allowing deadlock and setting long timeouts are suboptimal for fault detection even when paired with conservative recovery from the penultimate checkpoint.When MPI is used as a communication mechanism within a CPR library, such libraries must offer fault-tolerant extensions with minimal detection, isolation, mitigation, and potential recovery semantics to aid the CPR's library fail-backward. Communication between MPI and the checkpoint library regarding system health may be valuable. For fault-tolerant MPI programs (e.g., using APIs like FA-MPI, Stages/Reinit, or ULFM), the checkpoint library must cooperate with the extended model or else invalidate fault-tolerant operation.

show abstract

Section: Identifying the Issuesmentioning

confidence: 99%

Section: Identifying the Issuesmentioning

confidence: 99%

See 1 more Smart Citation

Checkpoint-Restart Libraries Must Become More Fault Tolerant

Skjellum¹,

Schafer²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Lastly, Sultana et al [32] propose MPI stages to reduce the overhead of global-restart recovery by checkpointing MPI state, so that rolling back does not have to re-create it. While this approach is interesting, it is still in proof-ofconcept status.…”

Section: Related Workmentioning

confidence: 99%

Reinit$$^{++}$$: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

Georgakoudis

Guo

Laguna

2020

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Scaling supercomputers comes with an increase in failure rates due to the increasing number of hardware components. In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure occurs to resume from the latest checkpoint. However, redeploying an application incurs overhead by tearing down and reinstating execution, and possibly limiting checkpointing retrieval from slow permanent storage. In this paper we present Reinit ++ , a new design and implementation of the Reinit approach for global-restart recovery, which avoids application re-deployment. We extensively evaluate Reinit ++ contrasted with the leading MPI fault-tolerance approach of ULFM, implementing globalrestart recovery, and the typical practice of restarting an application to derive new insight on performance. Experimentation with three different HPC proxy applications made resilient to withstand process and node failures shows that Reinit ++ recovers much faster than restarting, up to 6×, or ULFM, up to 3×, and that it scales excellently as the number of MPI processes grows.

show abstract

“…Reinit provides a simple interface to programmers to define a global restart point, in the form of a resilient target function. The early versions [13], [19], [36], [37] of Reinit have limited usage because they require hard-to-deploy changes to job schedulers. Most recently, Georgakoudis et al [14] propose a new design and implementation of Reinit into the Open MPI runtime.…”

Section: E Use Of Matchmentioning

confidence: 99%

MATCH: An MPI Fault Tolerance Benchmark Suite

Guo

Georgakoudis

Parasyris

et al. 2020

2020 IEEE International Symposium on Workload Characterization (IISWC)

Self Cite

View full text Add to dashboard Cite

MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate distributed scientific applications running on tens of hundreds of processes and compute nodes. Maintaining the correctness and integrity of MPI application execution is critical, especially for safety-critical scientific applications. Therefore, a collection of effective MPI fault tolerance techniques have been proposed to enable MPI application execution to efficiently resume from system failures. However, there is no structured way to study and compare different MPI fault tolerance designs, so to guide the selection and development of efficient MPI fault tolerance techniques for distinct scenarios. To solve this problem, we design, develop, and evaluate a benchmark suite called MATCH to characterize, research, and comprehensively compare different combinations and configurations of MPI fault tolerance designs. Our investigation derives useful findings: (1) Reinit recovery in general performs better than ULFM recovery; (2) Reinit recovery is independent of the scaling size and the input problem size, whereas ULFM recovery is not; (3) Using Reinit recovery with FTI checkpointing is a highly efficient fault tolerance design. MATCH code is available at https://github.com/kakulo/MPI-FT-Bench.

show abstract

Failure recovery for bulk synchronous applications with MPI stages

Cited by 10 publications

References 9 publications

Checkpoint-Restart Libraries Must Become More Fault Tolerant

Checkpoint-Restart Libraries Must Become More Fault Tolerant

Reinit$$^{++}$$: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

MATCH: An MPI Fault Tolerance Benchmark Suite

Contact Info

Product

Resources

About