Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis 2011
DOI: 10.1145/2063384.2063428
|View full text |Cite
|
Sign up to set email alerts
|

Checkpointing strategies for parallel jobs

Abstract: This work provides an analysis of checkpointing strategies for minimizing expected job execution times in an environment that is subject to processor failures. In the case of both sequential and parallel jobs, we give the optimal solution for exponentially distributed failure inter-arrival times, which, to the best of our knowledge, is the first rigorous proof that periodic checkpointing is optimal. For non-exponentially distributed failures, we develop a dynamic programming algorithm to maximize the amount of… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

3
85
0

Year Published

2012
2012
2018
2018

Publication Types

Select...
4
3
1

Relationship

5
3

Authors

Journals

citations
Cited by 74 publications
(88 citation statements)
references
References 24 publications
3
85
0
Order By: Relevance
“…Therefore, in the optimal case, the number of checkpoints equals the number of failures, which equals the number of recoveries. There are various works that define optimal checkpoint intervals [28], [29]. Finally, we assume that checkpoint commit is synchronous; that is, the primary application process is paused during the commit operation and is not resumed until checkpoint commit is complete.…”
Section: A Checkpoint Compression Viability Modelmentioning
confidence: 99%
“…Therefore, in the optimal case, the number of checkpoints equals the number of failures, which equals the number of recoveries. There are various works that define optimal checkpoint intervals [28], [29]. Finally, we assume that checkpoint commit is synchronous; that is, the primary application process is paused during the commit operation and is not resumed until checkpoint commit is complete.…”
Section: A Checkpoint Compression Viability Modelmentioning
confidence: 99%
“…The first step is to generate a fault distribution: we use an existing fault simulator developed in [21,22]. In our case, we use this simulator with an exponential law of parameter λ.…”
Section: Simulation Settingsmentioning
confidence: 99%
“…Many models are available to understand the behavior of checkpoint/restart [19,20,21,22], and thereby to define an optimal checkpoint period. [23] proposes a scalability model to evaluate the impact of failures on application performance.…”
Section: Related Workmentioning
confidence: 99%
“…, t j , and to checkpoint after t j , without any intermediate checkpoint, and knowing that a checkpoint has been taken after task t i−1 . To the best of our knowledge, the expectation E(W, C) of the time needed to successfully compute during W seconds and then take a checkpoint of duration C is known only for Exponentially distributed failures; from [22], we know that:…”
Section: Optimal Incremental Checkpointing Strategymentioning
confidence: 99%