1974
DOI: 10.1145/361147.361115
|View full text |Cite
|
Sign up to set email alerts
|

A first order approximation to the optimum checkpoint interval

Abstract: To avoid having to restart a job from the beginning in case of random failure, it is standard practice to save periodically sufficient information to enable the job to be restarted at the previous point at which information was saved. Such points are referred to as checkpoints, and the saving of such information at these points is called checkpointing [1].

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

8
424
0
5

Year Published

1997
1997
2017
2017

Publication Types

Select...
7
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 549 publications
(437 citation statements)
references
References 0 publications
8
424
0
5
Order By: Relevance
“…Let the time interval between checkpoints be T c , the time to save checkpoint information be T s , and the mean time between failures (MTBF) be T f . Then, the optimal checkpoint rate is T c = 2 × T s × T f [53]. We also observed that the mean checkpoint time (T s ) for BT, CG, FT, LU and SP with class C inputs on 4, 8 or 9 and 16 nodes is 23 seconds on the same experimental cluster [51].…”
Section: G Proactive Ft Complements Reactive Ftmentioning
confidence: 59%
“…Let the time interval between checkpoints be T c , the time to save checkpoint information be T s , and the mean time between failures (MTBF) be T f . Then, the optimal checkpoint rate is T c = 2 × T s × T f [53]. We also observed that the mean checkpoint time (T s ) for BT, CG, FT, LU and SP with class C inputs on 4, 8 or 9 and 16 nodes is 23 seconds on the same experimental cluster [51].…”
Section: G Proactive Ft Complements Reactive Ftmentioning
confidence: 59%
“…We derive that C i,j = mi jτ + β. As for the checkpointing period τ i,j , we use Young's formula [17] and let…”
Section: Fault Modelmentioning
confidence: 99%
“…Many models are available to understand the behavior of checkpoint/restart [19,20,21,22], and thereby to define an optimal checkpoint period. [23] proposes a scalability model to evaluate the impact of failures on application performance.…”
Section: Related Workmentioning
confidence: 99%