2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis 2010
DOI: 10.1109/sc.2010.18
|View full text |Cite
|
Sign up to set email alerts
|

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
379
0
3

Year Published

2012
2012
2021
2021

Publication Types

Select...
3
3
3

Relationship

1
8

Authors

Journals

citations
Cited by 412 publications
(383 citation statements)
references
References 23 publications
1
379
0
3
Order By: Relevance
“…Latent errors, also known as silent errors or silent data corruption, represent a major threat to scientific applications executing on large scale platforms [21,22,23]. There are several causes of silent errors, such as cosmic radiation, packaging pollution, among others.…”
Section: Related Workmentioning
confidence: 99%
“…Latent errors, also known as silent errors or silent data corruption, represent a major threat to scientific applications executing on large scale platforms [21,22,23]. There are several causes of silent errors, such as cosmic radiation, packaging pollution, among others.…”
Section: Related Workmentioning
confidence: 99%
“…Moody et al introduced multi-level checkpointing to improve scalability [29]. Traditional checkpoint systems use the parallel file system (PFS) to store the checkpoint data.…”
Section: Related Workmentioning
confidence: 99%
“…On the other hand, application-level checkpoint assumes that the state of the tasks is enough to resume the execution of the program in case of a failure. The SCR library [3] uses this approach. One advantage of application-level checkpoint is to dramatically reduce the amount of memory to be checkpointed.…”
Section: A Checkpoint/restartmentioning
confidence: 99%
“…This paper compares three standard checkpoint-based fault tolerance methods according to their energy consumption. The first method is the traditional checkpoint/restart based on local storage that has been implemented in several libraries [3], [4]. The second strategy is a particular version of message-logging [5] that requires messages to be stored, but avoids a global rollback in case of a failure.…”
Section: Introductionmentioning
confidence: 99%