2019
DOI: 10.1109/tpds.2018.2866794
|View full text |Cite
|
Sign up to set email alerts
|

CRAFT: A Library for Easier Application-Level Checkpoint/Restart and Automatic Fault Tolerance

Abstract: In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard failures. Application-level CR is the most effective CR technique in terms of overhead efficiency but it takes a lot of implementation effort. This work presents the implementation of our C++ based library CRAFT (Chec… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
47
0
1

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
2
2

Relationship

2
6

Authors

Journals

citations
Cited by 50 publications
(48 citation statements)
references
References 33 publications
0
47
0
1
Order By: Relevance
“…At various places, measures for improving resilience have been included, based on verifying known properties of computed quantities and on checksums, combined with checkpoint-restart. To simplify incorporating the latter into numerical algorithms, the Checkpoint-Restart and Automatic Fault Tolerance (CRAFT) library has been developed [30]. Figure 3 illustrates its use within the BEAST framework.…”
Section: The Essex-ii Projectmentioning
confidence: 99%
“…At various places, measures for improving resilience have been included, based on verifying known properties of computed quantities and on checksums, combined with checkpoint-restart. To simplify incorporating the latter into numerical algorithms, the Checkpoint-Restart and Automatic Fault Tolerance (CRAFT) library has been developed [30]. Figure 3 illustrates its use within the BEAST framework.…”
Section: The Essex-ii Projectmentioning
confidence: 99%
“…They are often divided into [7]: (i) system-level CR [37], (ii) library-level CR [38], and (iii) application-level CR (ALCR) [7]. ALCR [7], [21] is considered the most efficient, since it leaves the smallest memory footprint [7], [8], [22]; however it requires manual source code modifications for introducing checkpoints into the program.…”
Section: Prior Workmentioning
confidence: 99%
“…Application Level Checkpoint and Restart (ALCR) is widely used to enhance the reliability of long-running programs [6]- [8] by periodically saving a copy or checkpoint of the current execution state of software. The most recent copy is then used to restart program execution in case of failure.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…The User Level Failure Mitigation (ULFM) [15] corresponds to the most recent effort for the inclusion of resilience capabilities in the MPI standard, enabling applications to detect and react to failures without stopping their execution. Several works have implement resilient applications using the ULFM features [16]- [22]. ULFM enables the deployment of different recovery strategies after repairing the communication environment when a failure hits the application, thus, avoiding the overheads of re-initializing the entire MPI application.…”
Section: Related Workmentioning
confidence: 99%