Numerical computation algorithms for sequential checkpoint placement

Ozaki, T.; Kaio, Naoto

doi:10.1016/j.peva.2008.11.003

Cited by 19 publications

(12 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Different from the previous [8] and [10][11][12], the proposed mathematical analytical model is independent of the explicit expression of F (t). In other words, the availability of the proposed checkpoint scheduling algorithm cannot be affected by the variety of the failure rate r(t) = F (t)…”

Section: Discussionmentioning

confidence: 99%

“…According to [9], it can be concluded that a constant CI is optimal on condition that the system fault follows Poisson/exponential process. For the particular PF 2 failure distribution, the non-increasing CI sequence can be performed in [10][11][12]. For large-scale HPC system, Liu presented the reliability-aware method for optimal checkpoint/restart strategy to minimize rollback recovery and checkpointing overheads [13,14].…”

Section: Introductionmentioning

confidence: 99%

“…For large-scale HPC system, Liu presented the reliability-aware method for optimal checkpoint/restart strategy to minimize rollback recovery and checkpointing overheads [13,14]. However, the models in [9][10][11][12][13][14] assume that no failure event occurs during the rollback recovery phase, which is not a considerate representation for the characteristic of the rollback recovery execution. Besides, [15][16][17][18][19] also intended to determine the optimal checkpoint sequence under a certain circumstance in terms of the failure distribution.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Checkpoint scheduling model for optimality

Xu¹,

Men²,

Li³

et al. 2011

Information Processing Letters

View full text Add to dashboard Cite

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Checkpoint scheduling model for optimality

Xu¹,

Men²,

Li³

et al. 2011

Information Processing Letters

View full text Add to dashboard Cite

“…Conversely, if CPs are seldom placed, a larger RB overhead after a system failure will be required. Hence, it is important to determine the optimal CP interval taking account of the trade-off between two kinds of overhead factors above [4], [5], [16]. Gelenbe at al.…”

Section: Introductionmentioning

confidence: 99%

“…Bouguerra et al [2] also give an analytical model with coordinated CP/RB for a large scale cluster system. It is worth mentioning that the above works are based on the direct application of the similar analytical techniques to the CP placement for coherent computer systems [4], [5], [16]. However, the above works did not consider the possibility of occurrence of multi-node failure.…”

Section: Introductionmentioning

confidence: 99%

Comparing Checkpoint and Rollback Recovery Schemes in a Cluster System

Noriaki

2012

Algorithms and Architectures for Parallel Processing

Self Cite

View full text Add to dashboard Cite

Abstract. Cluster systems play a central role to realize high performance computing with relatively low cost, and at the same time are necessary the fault-tolerance features for the practical use. In this paper we develop stochastic models to evaluate the expected total recovery overhead for a cluster computing system with three well-known checkpoint and rollback recovery schemes; checkpoint mirroring, central file server checkpointing and skewed checkpointing, where the fault latency time after a system failure is given by a random variable. In general, since the multi-node failure as well as single-node failure may occur in the cluster system, it is not so easy to obtain the closed form of expected total recovery overhead. Based on a simple failure model, we do this by listing up all the possible combinations of probabilistic events caused by the multi-node failure. Further we compare the respective expected total recovery overhead with different checkpoint and rollback recovery schemes, and evaluate quantitatively the effectiveness of these schemes.

show abstract

Equidistant Checkpoint Placement for Checkpointing and Rollback Recovery

Yin

2013

Lecture Notes in Electrical Engineering

View full text Add to dashboard Cite

Numerical computation algorithms for sequential checkpoint placement

Cited by 19 publications

References 34 publications

Checkpoint scheduling model for optimality

Checkpoint scheduling model for optimality

Comparing Checkpoint and Rollback Recovery Schemes in a Cluster System

Equidistant Checkpoint Placement for Checkpointing and Rollback Recovery

Contact Info

Product

Resources

About