2004
DOI: 10.1145/1037949.1024421
|View full text |Cite
|
Sign up to set email alerts
|

Application-level checkpointing for shared memory programs

Abstract: Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) -the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR.Our group is investigating the use of compiler technology to instrument codes to make them self-c… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
29
0

Year Published

2007
2007
2019
2019

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 20 publications
(29 citation statements)
references
References 25 publications
0
29
0
Order By: Relevance
“…Bronevetsky et al provide a source to source compiler tool that can automatically instruments the code to save and restore its own status. The tool coordinates checkpoints and restarts for parallel OpenMP [18], [19] and MPI programs [20]- [22].…”
Section: Related Workmentioning
confidence: 99%
“…Bronevetsky et al provide a source to source compiler tool that can automatically instruments the code to save and restore its own status. The tool coordinates checkpoints and restarts for parallel OpenMP [18], [19] and MPI programs [20]- [22].…”
Section: Related Workmentioning
confidence: 99%
“…Blocking coordinated checkpointing with global barrier has been well used in OpenMP programs [20]. However, this checkpointing method has not been used in MPI programs.…”
Section: Related Workmentioning
confidence: 99%
“…Thus, client services do not notice that the service has upgraded, except that client services of the new type may see improved performance and fewer rejected requests, and client services of the old type may see decreased performance and more rejected requests. We adopted checkpointing technology [4][5] [10][11] and process migration [12] technology to save the states of service states and recover the states for the new version service.…”
Section: States Managementmentioning
confidence: 99%
“…So the system updating procedure is an atom transaction [9]. The checkpointing technology [4][5] [10][11] and process migration [12] technology are adopted as state saving technology and state recovering.…”
Section: Updating Transactions Of Grid-based Systemmentioning
confidence: 99%