2003
DOI: 10.1145/966049.781513
|View full text |Cite
|
Sign up to set email alerts
|

Automated application-level checkpointing of MPI programs

Abstract: Because of increasing hardware and software complexity, the running time of many computational science applications is now more than the mean-time-to-failure of highpeformance computing platforms. Therefore, computational science applications need to tolerate hardware failures.In this paper, we focus on the stopping failure model in which a faulty process hangs and stops responding to the rest of the system. We argue that tolerating such faults is best done by an approach called application-level coordinated n… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
56
0
1

Year Published

2005
2005
2022
2022

Publication Types

Select...
4
3
3

Relationship

0
10

Authors

Journals

citations
Cited by 60 publications
(57 citation statements)
references
References 16 publications
0
56
0
1
Order By: Relevance
“…Our approach is motivated by the observation that on today's ever-larger systems, checkpointing has become a standard fault tolerance practice, in answer to the shortening MTBF (mean time between failures). With portable, optimized tools such as BLCR [17], parallel file systems designed with checkpointing as a major workload or even specifically for the checkpointing purpose (such as PLFS [18]), emerging high-performance hardware such as aggregated SSD storage, and a large amount of recent/ongoing research on checkpointing [19][20][21], the efficiency and scalability of job checkpointing has been improving significantly. Such growing checkpoint capability on large-scale systems enables us to relax the backfill conditions, by allowing jobs to be aggressively backfilled, and suspended for later execution if resources are due for advance reservation.…”
Section: Cdf (%)mentioning
confidence: 99%
“…Our approach is motivated by the observation that on today's ever-larger systems, checkpointing has become a standard fault tolerance practice, in answer to the shortening MTBF (mean time between failures). With portable, optimized tools such as BLCR [17], parallel file systems designed with checkpointing as a major workload or even specifically for the checkpointing purpose (such as PLFS [18]), emerging high-performance hardware such as aggregated SSD storage, and a large amount of recent/ongoing research on checkpointing [19][20][21], the efficiency and scalability of job checkpointing has been improving significantly. Such growing checkpoint capability on large-scale systems enables us to relax the backfill conditions, by allowing jobs to be aggressively backfilled, and suspended for later execution if resources are due for advance reservation.…”
Section: Cdf (%)mentioning
confidence: 99%
“…Such schemes are used by CoCheck [23], Starfish [1], Clip [10] and AMPI [15,26] to provide fault tolerant versions of MPI. A coordinated checkpointing algorithm that uses application level checkpointing is presented in [8]. Communication induced checkpoint protocols try to combine the advantages of coordinated and uncoordinated by allowing processors to take a mix of independent and coordinated checkpoints [7].…”
Section: Related Workmentioning
confidence: 99%
“…System-level checkpoints at remote storage cause large amounts of data to be sent through the network, but applicationlevel checkpoints require modifications of the application code, and as such are not completely transparent to the programmer, in the sense that a code written for a non-fault-tolerant implementation of MPI requires some modifications to be executed on a fault-tolerant implementation of MPI using application-level checkpoints [Schulz et al 2004] [Bronevetsky et al 2003]. …”
Section: Related Workmentioning
confidence: 99%