2007 IEEE International Parallel and Distributed Processing Symposium 2007
DOI: 10.1109/ipdps.2007.370605
|View full text |Cite
|
Sign up to set email alerts
|

The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI

Abstract: To be able to fully exploit ever larger computing platforms, modern HPC applications and system software must be able to tolerate inevitable faults. Historically, MPI implementations that incorporated fault tolerance capabilities have been limited by lack of modularity, scalability and usability. This paper presents the design and implementation of an infrastructure to support checkpoint/restart fault tolerance in the Open MPI project. We identify the general capabilities required for distributed checkpoint/re… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
104
0

Year Published

2009
2009
2022
2022

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 143 publications
(104 citation statements)
references
References 19 publications
0
104
0
Order By: Relevance
“…Methods for tolerating errors in hardware have been studied in the past [6,9]. This resulted in several solutions at different levels of the design stack, in both hardware and software.…”
Section: Introductionmentioning
confidence: 99%
“…Methods for tolerating errors in hardware have been studied in the past [6,9]. This resulted in several solutions at different levels of the design stack, in both hardware and software.…”
Section: Introductionmentioning
confidence: 99%
“…Some MPI libraries make use of system-level checkpointing in application-level checkpointing. Open MPI [13] and LAM/MPI [23] have chosen to implement generic checkpoint/restart mechanisms that can support multiple existing kernel checkpointers. Since this approach fits well with ours, these two libraries can be supported by our service, too.…”
Section: Application-level Checkpointingmentioning
confidence: 99%
“…The Open MPI Checkpoint/Restart framework [13] is very similar to XtreemGCP but limited to the MPI context. The framework architecture is based on a central coordinator, is able to address various process checkpointers.…”
Section: Related Workmentioning
confidence: 99%
“…The implementation, while realized over LAM (Local Area Multicomputer)/MPI's C/R support [43] through Berkeley Labs C/R (BLCR) [15], is in its mechanisms applicable to any process-migration solution, e.g., the Open MPI FT mechanisms [24], [25]. BLCR is an open source, systemlevel C/R implementation integrated with LAM/MPI via a callback function.…”
Section: Introductionmentioning
confidence: 99%