Proceedings of the 3rd Workshop on Fault-Tolerance for HPC at Extreme Scale 2013
DOI: 10.1145/2465813.2465821
|View full text |Cite
|
Sign up to set email alerts
|

When is multi-version checkpointing needed?

Abstract: The scaling of semiconductor technology and increasing power concerns combined with system scale make fault management a growing concern in high performance computing systems. Greater variety of errors, higher error rates, longer detection intervals, and "silent" errors are all expected. Traditional checkpointing models and systems assume that error detection is nearly immediate and thus preserving a single checkpoint is sufficient for resilience.We define a richer model for future systems that captures the re… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
42
0
2

Year Published

2014
2014
2019
2019

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 44 publications
(44 citation statements)
references
References 33 publications
0
42
0
2
Order By: Relevance
“…This section describes some related work on detecting and handling silent errors. A more comprehensive list of techniques and references is provided by Lu, Zheng and Chien [20].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…This section describes some related work on detecting and handling silent errors. A more comprehensive list of techniques and references is provided by Lu, Zheng and Chien [20].…”
Section: Related Workmentioning
confidence: 99%
“…One approach to dealing with silent errors is by maintaining several checkpoints in memory [20]. This multiple-checkpoint approach, however, has three major drawbacks.…”
Section: Introductionmentioning
confidence: 99%
“…In the case of fail-stop failures, a checkpoint cannot contain a corrupted state, because a process subject to failure will not create a checkpoint or participate to the application: failures are naturally contained to failed processes; in the case of silent errors, however, faults can propagate to other processes and checkpoints, because processes continue to participate and follow the protocol during the interval that separates the error and its detection. To alleviate this issue, one may envision to keep several checkpoints in memory, and to restore the application from the last valid checkpoint, thereby rolling back to the last correct state of the application [55]. This multiple-checkpoint approach has three major drawbacks.…”
Section: Motivationmentioning
confidence: 99%
“…Multiple versions can be useful during recovery from latent errors [6]. Traditional checkpoint/restart systems keep only the latest checkpoint, because they assume that checkpoint data is correct.…”
Section: Multi-versioning In Global View Resiliencementioning
confidence: 99%
“…It has two key features: multi-version, multi-stream distributed arrays and a unified error handling interface that supports flexible cross-layer error checking and recovery. Multi-versioning is a promising approach for handling latent errors [6], since a high probability exists that some versions have been created before the latent error corrupted the data. Introducing the concept of multi-version arrays, however, immediately raises a question of the cost of creating and keeping such multiple versions.…”
Section: Introductionmentioning
confidence: 99%