2004
DOI: 10.1145/1037187.1024421
|View full text |Cite
|
Sign up to set email alerts
|

Application-level checkpointing for shared memory programs

Abstract: Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) -the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR.Our group is investigating the use of compiler technology to instrument codes to make them self-c… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
26
0

Year Published

2010
2010
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 19 publications
(26 citation statements)
references
References 15 publications
0
26
0
Order By: Relevance
“…Bronevetsky et al [6][7][8] have proposed a preprocessor-based approach for ALC. Their work is relevant for both shared memory and distributed memory architectures and their approach consists of two components: a preprocessor, and a checkpointing library.…”
Section: Background and Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Bronevetsky et al [6][7][8] have proposed a preprocessor-based approach for ALC. Their work is relevant for both shared memory and distributed memory architectures and their approach consists of two components: a preprocessor, and a checkpointing library.…”
Section: Background and Related Workmentioning
confidence: 99%
“…During the domain analysis phase of developing the DSL, a survey of technical literature and existing implementations [6,7,14,15,20] was done to obtain an overview of the terminologies and concepts related to the ALC-domain. Commonly used terms and their relationships were used to develop the domain lexicon.…”
Section: Domain Analysismentioning
confidence: 99%
See 1 more Smart Citation
“…Existing checkpoint approaches can be classified into four broad categories: (a) schemes that require applications to provide their own specialized checkpoint and recovery mechanisms (Bronevetsky et al 2003;Bronevetsky et al 2004); (b) schemes in which the compiler determines where checkpoints can be safely inserted (Beck et al 1994); (c) techniques that require operating system or hardware monitoring of thread state (Li et al 1990;Hulse 1995;Chen et al 1997); and (d) library implementations that capture and restore state (Dieter & Lumpp 1999). Checkpointing functionality provided by an application or a library relies on the programmer to define meaningful checkpoints.…”
Section: Related Workmentioning
confidence: 99%
“…Cornell Checkpointing Compiler (C 3 ) is an application level checkpointing implementation for multi-thread applications in shared memory [12]. The checkpoint is done in three steps: each thread (1) calls a barrier, (2) saves its private state, and (3) calls a second barrier.…”
Section: Related Workmentioning
confidence: 99%