2014
DOI: 10.1371/journal.pone.0104591
|View full text |Cite
|
Sign up to set email alerts
|

Two-Level Incremental Checkpoint Recovery Scheme for Reducing System Total Overheads

Abstract: Long-running applications are often subject to failures. Once failures occur, it will lead to unacceptable system overheads. The checkpoint technology is used to reduce the losses in the event of a failure. For the two-level checkpoint recovery scheme used in the long-running tasks, it is unavoidable for the system to periodically transfer huge memory context to a remote stable storage. Therefore, the overheads of setting checkpoints and the re-computing time become a critical issue which directly impacts the … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2017
2017
2021
2021

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 10 publications
(1 citation statement)
references
References 35 publications
0
1
0
Order By: Relevance
“…Among the variety of recovery techniques explored by modeling, simulation and experiments at scale, the de-facto one in HPC is checkpointing and rollback, in which the state of the parallel application is saved at successive time instant over the computing time. Being the single-level coordinated checkpoint scheme the most implemented one [7][8][9], other sophisticated versions of fault-tolerant protocols have been proposed during the last years, as it is the case of the multi-level (two-level and beyond) checkpointing [10][11][12][13][14] or the hierarchical approach among others [15][16][17][18][19][20].…”
Section: Introductionmentioning
confidence: 99%
“…Among the variety of recovery techniques explored by modeling, simulation and experiments at scale, the de-facto one in HPC is checkpointing and rollback, in which the state of the parallel application is saved at successive time instant over the computing time. Being the single-level coordinated checkpoint scheme the most implemented one [7][8][9], other sophisticated versions of fault-tolerant protocols have been proposed during the last years, as it is the case of the multi-level (two-level and beyond) checkpointing [10][11][12][13][14] or the hierarchical approach among others [15][16][17][18][19][20].…”
Section: Introductionmentioning
confidence: 99%