Algorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method

Pachajoa, Carlos; Pacher, Christina; Levonyak, Markus; Gansterer, Wilfried N.

doi:10.1145/3404397.3404438

Cited by 5 publications

(6 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Pachajoa et al [31] compared exact state reconstruction (ESR) approach based on the method proposed by Chen [7] with the heuristic linear interpolation (LI) approach by Langou et al [28] and Agullo et al [1,2]. They later extended the ESR approach for protecting the PCG method against multiple and simultaneous node failures [32,33]. Altogether, fault-tolerance methods proposed to mitigate the impact of fail-stop errors in iterative applications are application-specific, and can only be applied to a particular class of iterative algorithms.…”

Section: Rr N°9371mentioning

confidence: 99%

Optimal Checkpointing Strategies for Iterative Applications

Marchal

Pallez

et al. 2022

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

This work provides an optimal checkpointing strategy to protect iterative applications from fail-stop errors. We consider a very general framework, where the application repeats the same execution pattern by executing consecutive iterations, and where each iteration is composed of several tasks. These tasks have different execution lengths and different checkpoint costs. Assume that there are n tasks and that task a i , where 0 ≤ i < n, has execution time t i and checkpoint cost C i . A naive strategy would checkpoint after each task. A strategy inspired by the Young/Daly formula would select the task a min with smallest checkpoint cost C min and would checkpoint after every p th instance of that task, leading to a checkpointing period P Y D = pT where T = n−1 i=0 a i is the time per iteration. One would choose the period so that P Y D = pT ≈ √ 2µC min to obey the Young/Daly formula, where µ is the application MTBF. Both the naive and Young/Daly strategies are suboptimal. Our main contribution is to show that the optimal checkpoint strategy is globally periodic, and to design a dynamic programming algorithm that computes the optimal checkpointing pattern. This pattern may well checkpoint many different tasks, and this across many different iterations. We show through simulations, both from synthetic and real-life application scenarios, that the optimal strategy significantly outperforms the naive and Young/Daly strategies.

show abstract

Section: Rr N°9371mentioning

confidence: 99%

Optimal Checkpointing Strategies for Iterative Applications

Marchal

Pallez

et al. 2022

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…There is a generic strategy for identifying the state of an iterative linear algebra solver and for reconstructing the state upon recovery [14]. However, like prior work [16,17], we focus on the preconditioned conjugate gradient (PCG) solver, which solves the linear equation 𝐴𝑥 = 𝑏 for a symmetric positive definite matrix 𝐴 𝑛×𝑛 (see Algorithm 1).…”

Section: In-memory Esr and Its Challengesmentioning

confidence: 99%

“…ESRP [17] is a modification of ESR, where redundant copies are created every period to alleviate the networking overhead for each iteration. ESRP demonstrates a trade-off, where increasing the period of ESR decreases the runtime overhead, but increases the cost of discarding the iterations performed since the last storage stage was reached when recovery is required.…”

Section: In-memory Esr and Its Challengesmentioning

confidence: 99%

“…Finally, it is necessary to determine where the state is saved. In traditional systems, the state has to be saved in other cluster nodes [17], so that whenever a node fails, the surviving nodes send the state of the failed node to a spare node (Figure 1a).…”

Section: In-memory Esr and Its Challengesmentioning

confidence: 99%

“…Necessary conditions on solvers for the applicability of ESR [14,16,17] are: (1) The iterative algorithm performs a finite-term recurrence, and (2) The iterative algorithm involves a matrix-vector product. In particular, ESR was applied to the Preconditioned Conjugate Gradient (PCG) solver, solving the linear equation 𝐴𝑥 = 𝑏 for a symmetric positive definite matrix 𝐴 𝑛×𝑛 .…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

NVM-ESR: Using Non-Volatile Memory in Exact State Reconstruction of Preconditioned Conjugate Gradient

Yehonatan¹,

Snir²,

Levin³

et al. 2022

Preprint

View full text Add to dashboard Cite

HPC systems are a critical resource for scientific research and advanced industries. The demand for computational power and memory is increasing and ushers in the exascale era, in which supercomputers are designed to provide enormous computing power to meet these needs. These complex supercomputers consist of many compute nodes and are consequently expected to experience frequent faults and crashes. Exact state reconstruction (ESR) has been proposed as a mechanism to alleviate the impact of frequent failures on long-term computations. ESR has shown great potential in the context of iterative linear algebra solvers, a key building block in numerous scientific applications.Recent designs of supercomputers feature the emerging nonvolatile memory (NVM) technology. For example, the Exascale Aurora supercomputer is planned to integrate Intel Optane™ DCPMM. This work investigates how NVM can be used to improve ESR so that it can scale to future exascale systems such as Aurora and provide enhanced resilience.We propose the non-volatile memory ESR (NVM-ESR) mechanism. NVM-ESR demonstrates how NVM can be utilized in supercomputers for enabling efficient recovery from faults while requiring significantly smaller memory footprint and time overheads in comparison to ESR. We focus on the preconditioned conjugate gradient (PCG) iterative solver also studied in prior ESR research, because it is employed by the representative HPCG scientific benchmark.The source code used by this work, as well as the benchmarks and other relevant sources, are available at: https://github.com/ Scientific-Computing-Lab-NRCN/NVM-ESR.git.

show abstract

Research on the Implementation Method of Parallel Testing Based on Computer Technology

Dong

2023

Lecture Notes in Electrical Engineering

View full text Add to dashboard Cite

Algorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method

Cited by 5 publications

References 26 publications

Optimal Checkpointing Strategies for Iterative Applications

Optimal Checkpointing Strategies for Iterative Applications

NVM-ESR: Using Non-Volatile Memory in Exact State Reconstruction of Preconditioned Conjugate Gradient

Research on the Implementation Method of Parallel Testing Based on Computer Technology

Contact Info

Product

Resources

About