When is multi-version checkpointing needed?

Lu, Guoming; Zheng, Ziming; Chien, Andrew A.

doi:10.1145/2465813.2465821

Cited by 44 publications

(44 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…This section describes some related work on detecting and handling silent errors. A more comprehensive list of techniques and references is provided by Lu, Zheng and Chien [20].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Assessing the Impact of Partial Verifications against Silent Data Corruptions

Cavelan

Raina

Robert

et al. 2015

2015 44th International Conference on Parallel Processing

View full text Add to dashboard Cite

Silent errors, or silent data corruptions, constitute a major threat on very large scale platforms. When a silent error strikes, it is not detected immediately but only after some delay, which prevents the use of pure periodic checkpointing approaches devised for fail-stop errors. Instead, checkpointing must be coupled with some verification mechanism to guarantee that corrupted data will never be written into the checkpoint file. Such a guaranteed verification mechanism typically incurs a high cost. In this paper, we assess the impact of using partial verification mechanisms in addition to a guaranteed verification. The main objective is to investigate to which extent it is worthwhile to use some light cost but less accurate verifications in the middle of a periodic computing pattern, which ends with a guaranteed verification right before each checkpoint. Introducing partial verifications dramatically complicates the analysis, but we are able to analytically determine the optimal computing pattern (up to the first-order approximation), including the optimal length of the pattern, the optimal number of partial verifications, as well as their optimal positions inside the pattern. Performance evaluations based on a wide range of parameters confirm the benefit of using partial verifications under certain scenarios, when compared to the baseline algorithm that uses only guaranteed verifications.

show abstract

“…This section describes some related work on detecting and handling silent errors. A more comprehensive list of techniques and references is provided by Lu, Zheng and Chien [20].…”

Section: Related Workmentioning

confidence: 99%

“…One approach to dealing with silent errors is by maintaining several checkpoints in memory [20]. This multiple-checkpoint approach, however, has three major drawbacks.…”

Section: Introductionmentioning

confidence: 99%

Assessing the Impact of Partial Verifications against Silent Data Corruptions

Cavelan

Raina

Robert

et al. 2015

2015 44th International Conference on Parallel Processing

View full text Add to dashboard Cite

show abstract

“…In the case of fail-stop failures, a checkpoint cannot contain a corrupted state, because a process subject to failure will not create a checkpoint or participate to the application: failures are naturally contained to failed processes; in the case of silent errors, however, faults can propagate to other processes and checkpoints, because processes continue to participate and follow the protocol during the interval that separates the error and its detection. To alleviate this issue, one may envision to keep several checkpoints in memory, and to restore the application from the last valid checkpoint, thereby rolling back to the last correct state of the application [55]. This multiple-checkpoint approach has three major drawbacks.…”

Section: Motivationmentioning

confidence: 99%

Fault-Tolerance Techniques for High-Performance Computing

2015

Computer Communications and Networks

View full text Add to dashboard Cite

This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de-facto standard technique for resilience in High Performance Computing. We present the main two protocols, namely coordinated checkpointing and hierarchical checkpointing. Then we introduce performance models and use them to assess the performance of theses protocols. We cover the Young/Daly formula for the optimal period and much more! Next we explain how the efficiency of checkpointing can be improved via fault prediction or replication. Then we move to application-specific methods, such as ABFT. We conclude the report by discussing techniques to cope with silent errors (or silent data corruption).This report is a slightly modified version of the first chapter of the monograph Fault tolerance techniques for high-performance computing edited by Thomas Herault and Yves Robert, and to be published by Springer Verlag.

show abstract

“…Multiple versions can be useful during recovery from latent errors [6]. Traditional checkpoint/restart systems keep only the latest checkpoint, because they assume that checkpoint data is correct.…”

Section: Multi-versioning In Global View Resiliencementioning

confidence: 99%

“…It has two key features: multi-version, multi-stream distributed arrays and a unified error handling interface that supports flexible cross-layer error checking and recovery. Multi-versioning is a promising approach for handling latent errors [6], since a high probability exists that some versions have been created before the latent error corrupted the data. Introducing the concept of multi-version arrays, however, immediately raises a question of the cost of creating and keeping such multiple versions.…”

Section: Introductionmentioning

confidence: 99%

Empirical Comparison of Three Versioning Architectures

Fujita

Iskra

Balaji

et al. 2015

2015 IEEE International Conference on Cluster Computing

Self Cite

View full text Add to dashboard Cite

Future supercomputer systems will face serious reliability challenges. Among failure scenarios, latent errors are some of the most serious and concerning. Preserving multiple versions of critical data is a promising approach to deal with such errors. We are developing the Global View Resilience (GVR) library, with multi-version global arrays as one of the key features. This paper presents three array versioning architectures: flat array, flat array with change tracking, and log-structured array. We use a synthetic workload comparing the three array architectures in terms of runtime performance and memory requirements. The experiments show that the flat array with change tracking is the best architecture in terms of runtime performance, for versioning frequencies of 10 −5 ops −1 or higher matching the second best architecture or beating it by over 8 times, whereas the log-structured array is preferable for low memory usage, since it saves up to 88% of memory compared with a flat array.

show abstract

When is multi-version checkpointing needed?

Cited by 44 publications

References 33 publications

Assessing the Impact of Partial Verifications against Silent Data Corruptions

Assessing the Impact of Partial Verifications against Silent Data Corruptions

Fault-Tolerance Techniques for High-Performance Computing

Empirical Comparison of Three Versioning Architectures

Contact Info

Product

Resources

About