49th International Conference on Parallel Processing - ICPP 2020
DOI: 10.1145/3404397.3404438
|View full text |Cite
|
Sign up to set email alerts
|

Algorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method

Abstract: As computers reach exascale and beyond, the incidence of faults will increase. Solutions to this problem are an active research topic. We focus on strategies to make the preconditioned conjugate gradient (PCG) solver resilient against node failures, speci cally, the exact state reconstruction (ESR) method, which exploits redundancies in PCG. Reducing the frequency at which redundant information is stored lessens the runtime overhead. However, after the node failure, the solver must restart from the last iterat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
1
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(6 citation statements)
references
References 26 publications
0
6
0
Order By: Relevance
“…Pachajoa et al [31] compared exact state reconstruction (ESR) approach based on the method proposed by Chen [7] with the heuristic linear interpolation (LI) approach by Langou et al [28] and Agullo et al [1,2]. They later extended the ESR approach for protecting the PCG method against multiple and simultaneous node failures [32,33]. Altogether, fault-tolerance methods proposed to mitigate the impact of fail-stop errors in iterative applications are application-specific, and can only be applied to a particular class of iterative algorithms.…”
Section: Rr N°9371mentioning
confidence: 99%
“…Pachajoa et al [31] compared exact state reconstruction (ESR) approach based on the method proposed by Chen [7] with the heuristic linear interpolation (LI) approach by Langou et al [28] and Agullo et al [1,2]. They later extended the ESR approach for protecting the PCG method against multiple and simultaneous node failures [32,33]. Altogether, fault-tolerance methods proposed to mitigate the impact of fail-stop errors in iterative applications are application-specific, and can only be applied to a particular class of iterative algorithms.…”
Section: Rr N°9371mentioning
confidence: 99%
“…There is a generic strategy for identifying the state of an iterative linear algebra solver and for reconstructing the state upon recovery [14]. However, like prior work [16,17], we focus on the preconditioned conjugate gradient (PCG) solver, which solves the linear equation 𝐴𝑥 = 𝑏 for a symmetric positive definite matrix 𝐴 𝑛×𝑛 (see Algorithm 1).…”
Section: In-memory Esr and Its Challengesmentioning
confidence: 99%
“…ESRP [17] is a modification of ESR, where redundant copies are created every period to alleviate the networking overhead for each iteration. ESRP demonstrates a trade-off, where increasing the period of ESR decreases the runtime overhead, but increases the cost of discarding the iterations performed since the last storage stage was reached when recovery is required.…”
Section: In-memory Esr and Its Challengesmentioning
confidence: 99%
See 2 more Smart Citations