Recovery Patterns for Iterative Methods in a Parallel Unstable Environment

Langou, Julien; Chen, Zizhong; Bosilca, George; Dongarra, Jack

doi:10.1137/040620394

Cited by 45 publications

(63 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, if simultaneous errors impact a single relation we have two possible scenarios: 1) Simultaneous errors in a single vector are not a problem for our recovery strategy. This is trivial for vectors recovered from linear relations, and straightforward for submatrix relations [29]. For two failed blocks i and j, we can combine both block relations:…”

Section: Dealing With Multiple Errorsmentioning

confidence: 99%

“…Also, very aggressive resilience strategies like process triplication are completely impractical unless we face very high fault rates [18]. Therefore, intermediate solutions that recompute an approximation of the lost data [29] or that save the process state in a checkpoint with a certain frequency have been extensively used [12], [32], [37]. However, most of these solutions involve backward recoveries, discarding useful computations, and thus incur significant slowdowns.…”

Section: Introductionmentioning

confidence: 99%

“…The application itself may be able to handle the error and terminate cleanly [5] or perform some sort of recovery procedure relying on Algorithmic-Based Fault Tolerance (ABFT), which has been extensively applied to MPI programs [10], [17], [29], as well as shared memory programming models [38], [40]. Algorithmic approaches have demonstrated to be more efficient than backward recoveries like checkpointing-rollback.…”

Section: Introductionmentioning

confidence: 99%

“…• A mathematical proof showing that Langou et al's Lossy Approach [29] is the best ABFT recovery strategy of all the restart techniques in the literature.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Asynchronous and Exact Forward Recovery for Detected Errors in Iterative Solvers

Jaulmes¹,

Moretó²,

Ayguadé³

et al. 2018

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Abstract-Current trends and projections show that faults in computer systems become increasingly common. Such errors may be detected, and possibly corrected transparently, e.g. by Error Correcting Codes (ECC). For a program to be fault-tolerant, it needs to also handle the Errors that are Detected and Uncorrected (DUE), such as an ECC encountering too many bit flips in a codeword. While correcting an error has an overhead in itself, it can also affect the progress of a program. The most generic technique, rolling back the program state to a previously taken checkpoint, sets back any progress done since then. Alternately, application specific techniques exist, such as restarting an iterative program with its latest iteration's values as initial guess. We introduce a novel error correction technique for iterative linear solvers, designed to preserve both the progress made and the solver's future convergence by recovering the program's state exactly. Leveraging the asynchrony of task-based programming models, we mask our technique's overhead by overlapping error correction with the solver's normal workload. Our technique relies on analysing solvers to find redundancy in the form of relations between data. We are then able to restore discarded or corrupted data by recomputing or inverting the appropriate relations. We demonstrate that this approach allows to recover any part of three widely used Krylov subspace methods: CG, GMRES and BiCGStab, and their pre-conditioned versions. We implement our technique for CG and recover lost data at the scale of a memory page, which is the granularity at which Operating Systems (OS) report memory errors on commodity hardware, and study the effect of varying the memory page size to address non-standard sizes and the possible use of huge pages in High Performance Computing (HPC). When compared to checkpointing and to the state-of-the-art algorithmic restart technique, on small (8 cores) to large scale (1024 cores), our methods show less overhead. A trade-off arises between our straightforward and asynchronous approaches, based on the rate at which faults happen. At the lowest considered rate and page size, overlapping recoveries decreases their average cost from 5.40% to 2.24% of the ideal faultless execution time. Our methods generally outperform the state-of-the-art even with increased overheads on big page sizes, and perform similarly on edge cases. These results also indicate that our techniques are increasingly efficient as the matrix size increases.

show abstract

Section: Dealing With Multiple Errorsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

“…• A mathematical proof showing that Langou et al's Lossy Approach [29] is the best ABFT recovery strategy of all the restart techniques in the literature.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Asynchronous and Exact Forward Recovery for Detected Errors in Iterative Solvers

Jaulmes¹,

Moretó²,

Ayguadé³

et al. 2018

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…In a distributed environment the major cost of this method comes from obtaining the consistent snapshots and disk access to write the snapshots, which highlights the major drawback of such approaches, the relatively high overhead. Langou and Dongarra [36] investigated several checkpoint/recovery techniques and a checkpoint-free lossy fault tolerant technique for parallel iterative methods. Robert and Vivien [10,12] presented a unified model for several common checkpoint/restart protocols, extended in [16] to cover process replication.…”

Section: Related Workmentioning

confidence: 99%

Parallel reduction to hessenberg form with algorithm-based fault tolerance

Jia

Bosilca

Łuszczek

et al. 2013

Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

Resilience is the faculty of a computing system or software stack to provide the expected outcome regardless of system degradations and failures. This paper studies the resilience of two-sided factorizations and presents a generic algorithm-based approach capable of rendering two-sided factorizations resilient. We establish the theoretical proof of the correctness and the numerical stability of the approach in the context of a Hessenberg Reduction (HR) and present the scalability and performance results of a practical implementation. Our method is a hybrid algorithm combining an Algorithm Based Fault Tolerance (ABFT) technique with diskless checkpointing to fully protect the data. We protect the trailing and the proceeding matrix (block columns to the left of the current panel scope) with checksums, and protect finished panels in the panel scope (Q block columns containing the current panel) with diskless checkpoints. Compared with the original HR (the ScaLAPACK PDGEHRD routine) our fault-tolerant algorithm introduces very little overhead, and maintains the same level of scalability. We prove that the overhead shows a decreasing trend as the size of the matrix or the size of the process grid increases. Our proof, corroborated by experiments on a Cray XT6 platform, shows that the overhead of our algorithm reaches as low as 1.8% compared to the non-fault tolerant algorithm at matrix size 96000 × 96000 on a 96 × 96 process grid. To our knowledge our work is the first to investigate and implement resilient algorithms for the HR.

show abstract

Application health monitoring for extreme‐scale resiliency using cooperative fault management

Agarwal

Naughton

Park

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

Summary Resiliency is and will be a critical factor in determining scientific productivity on current and exascale supercomputers, and beyond. Applications oblivious to and incapable of handling transient soft and hard errors could waste supercomputing resources or, worse, yield misleading scientific insights. We introduce a novel application‐driven silent error detection and recovery strategy based on application health monitoring. Our methodology uses application output that follows known patterns, as indicators of an application's health and knowledge that violation of these patterns could be indication of faults. Information from system monitors that report hardware and software health status is used to corroborate faults. Collectively, this information is used by a fault coordinator agent to take preventive and corrective measures by applying computational steering to an application between checkpoints. This cooperative fault management system uses the Fault Tolerance Backplane as a communication channel. The benefits of this framework are demonstrated with two real application case studies, molecular dynamics, and quantum chemistry simulations, on scalable clusters with simulated memory and I/O corruptions. The developed approach is general and can be easily applied to other applications.

show abstract

Recovery Patterns for Iterative Methods in a Parallel Unstable Environment

Cited by 45 publications

References 8 publications

Asynchronous and Exact Forward Recovery for Detected Errors in Iterative Solvers

Asynchronous and Exact Forward Recovery for Detected Errors in Iterative Solvers

Parallel reduction to hessenberg form with algorithm-based fault tolerance

Application health monitoring for extreme‐scale resiliency using cooperative fault management

Contact Info

Product

Resources

About