2019
DOI: 10.1016/j.parco.2019.02.007
|View full text |Cite
|
Sign up to set email alerts
|

Failure recovery for bulk synchronous applications with MPI stages

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
5
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
3
1

Relationship

3
4

Authors

Journals

citations
Cited by 10 publications
(5 citation statements)
references
References 9 publications
0
5
0
Order By: Relevance
“…For instance, with MPI Stages (a Reinit-type model) [1], [2], the approach is to quiesce MPI and then call the checkpoint library for application state preservation plus optional MPI-Stages-related serialization strings for future recovery; then, the approach is to call a separate MPI-Stages internal state checkpoint (including application objects serialized). This concept of operations apparently does not contemplate that the CPR library actually uses MPI itself, thereby changing the internal state of MPI.…”
Section: Identifying the Issuesmentioning
confidence: 99%
See 2 more Smart Citations
“…For instance, with MPI Stages (a Reinit-type model) [1], [2], the approach is to quiesce MPI and then call the checkpoint library for application state preservation plus optional MPI-Stages-related serialization strings for future recovery; then, the approach is to call a separate MPI-Stages internal state checkpoint (including application objects serialized). This concept of operations apparently does not contemplate that the CPR library actually uses MPI itself, thereby changing the internal state of MPI.…”
Section: Identifying the Issuesmentioning
confidence: 99%
“…The application has to interpret that code in terms of the fault tolerant model it is itself using. 2) directly initiate recovery in the same model that the application uses 2 . An alternative to the error-return code is to recompile the CPR library with C++, while still using the MPI C interface, but augmenting it with additional C++ features that provide fault-tolerant extensions (e.g., exceptions) [6].…”
Section: Identifying the Issuesmentioning
confidence: 99%
See 1 more Smart Citation
“…Lastly, Sultana et al [32] propose MPI stages to reduce the overhead of global-restart recovery by checkpointing MPI state, so that rolling back does not have to re-create it. While this approach is interesting, it is still in proof-ofconcept status.…”
Section: Related Workmentioning
confidence: 99%
“…Reinit provides a simple interface to programmers to define a global restart point, in the form of a resilient target function. The early versions [13], [19], [36], [37] of Reinit have limited usage because they require hard-to-deploy changes to job schedulers. Most recently, Georgakoudis et al [14] propose a new design and implementation of Reinit into the Open MPI runtime.…”
Section: E Use Of Matchmentioning
confidence: 99%