Extending the scope of the Checkpoint‐on‐Failure protocol for forward recovery in standard MPI

With the advent and widespread use of web services, complex business processes are being built by discovering and composing services already available over the Internet. Such composite web services operate over the Internet where reliability and speed cannot be guaranteed, hence providing fault handling mechanism to composite web services is of primary importance. Several fault handling and recovery techniques have been proposed in literature in the context of web services. Backward recovery using checkpointing is a general recovery scheme that can be used to make web services resilient to faults. Checkpointing strategies proposed in distributed systems are not directly applicable to web services due to the differences in the two paradigms. Few papers have been published discussing the need and techniques for checkpointing web services. In this paper we present a survey on various distributed and web service checkpointing techniques discussing their applicability, strengths and weaknesses. We give a brief introduction to our approach of checkpointing web services which identifies checkpointing locations, without user intervention, using complexity of interactions and services offered.

show abstract

“…Few papers [13][14][15][16][17][18] have been published discussing the need and techniques for checkpointing web services.…”

Section: Checkpointing In Web Services: Literature Surveymentioning

confidence: 99%

A survey on checkpointing web services

Vathsala

Mohanty

2014

Proceedings of the 6th International Workshop on Principles of Engineering Service-Oriented and Cloud Systems

View full text Add to dashboard Cite

show abstract

“…Yao and Wang [52] proposed a nonstop algorithm based fault tolerant scheme to recover the solution vector from fail-stop process failures in HPL 2.0. Bland et al [7,8] proposed a Checipoint-on-Failure protocol for fault recovery in dense linear algebra.…”

Section: Related Workmentioning

confidence: 99%

Parallel reduction to hessenberg form with algorithm-based fault tolerance

Jia

Bosilca

Łuszczek

et al. 2013

Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Self Cite

View full text Add to dashboard Cite

Resilience is the faculty of a computing system or software stack to provide the expected outcome regardless of system degradations and failures. This paper studies the resilience of two-sided factorizations and presents a generic algorithm-based approach capable of rendering two-sided factorizations resilient. We establish the theoretical proof of the correctness and the numerical stability of the approach in the context of a Hessenberg Reduction (HR) and present the scalability and performance results of a practical implementation. Our method is a hybrid algorithm combining an Algorithm Based Fault Tolerance (ABFT) technique with diskless checkpointing to fully protect the data. We protect the trailing and the proceeding matrix (block columns to the left of the current panel scope) with checksums, and protect finished panels in the panel scope (Q block columns containing the current panel) with diskless checkpoints. Compared with the original HR (the ScaLAPACK PDGEHRD routine) our fault-tolerant algorithm introduces very little overhead, and maintains the same level of scalability. We prove that the overhead shows a decreasing trend as the size of the matrix or the size of the process grid increases. Our proof, corroborated by experiments on a Cray XT6 platform, shows that the overhead of our algorithm reaches as low as 1.8% compared to the non-fault tolerant algorithm at matrix size 96000 × 96000 on a 96 × 96 process grid. To our knowledge our work is the first to investigate and implement resilient algorithms for the HR.

show abstract

“…The application itself may be able to handle the error and terminate cleanly [5] or perform some sort of recovery procedure relying on Algorithmic-Based Fault Tolerance (ABFT), which has been extensively applied to MPI programs [10], [17], [29], as well as shared memory programming models [38], [40]. Algorithmic approaches have demonstrated to be more efficient than backward recoveries like checkpointing-rollback.…”

Section: Introductionmentioning

confidence: 99%

Asynchronous and Exact Forward Recovery for Detected Errors in Iterative Solvers

Jaulmes¹,

Moretó²,

Ayguadé³

et al. 2018

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Abstract-Current trends and projections show that faults in computer systems become increasingly common. Such errors may be detected, and possibly corrected transparently, e.g. by Error Correcting Codes (ECC). For a program to be fault-tolerant, it needs to also handle the Errors that are Detected and Uncorrected (DUE), such as an ECC encountering too many bit flips in a codeword. While correcting an error has an overhead in itself, it can also affect the progress of a program. The most generic technique, rolling back the program state to a previously taken checkpoint, sets back any progress done since then. Alternately, application specific techniques exist, such as restarting an iterative program with its latest iteration's values as initial guess. We introduce a novel error correction technique for iterative linear solvers, designed to preserve both the progress made and the solver's future convergence by recovering the program's state exactly. Leveraging the asynchrony of task-based programming models, we mask our technique's overhead by overlapping error correction with the solver's normal workload. Our technique relies on analysing solvers to find redundancy in the form of relations between data. We are then able to restore discarded or corrupted data by recomputing or inverting the appropriate relations. We demonstrate that this approach allows to recover any part of three widely used Krylov subspace methods: CG, GMRES and BiCGStab, and their pre-conditioned versions. We implement our technique for CG and recover lost data at the scale of a memory page, which is the granularity at which Operating Systems (OS) report memory errors on commodity hardware, and study the effect of varying the memory page size to address non-standard sizes and the possible use of huge pages in High Performance Computing (HPC). When compared to checkpointing and to the state-of-the-art algorithmic restart technique, on small (8 cores) to large scale (1024 cores), our methods show less overhead. A trade-off arises between our straightforward and asynchronous approaches, based on the rate at which faults happen. At the lowest considered rate and page size, overlapping recoveries decreases their average cost from 5.40% to 2.24% of the ideal faultless execution time. Our methods generally outperform the state-of-the-art even with increased overheads on big page sizes, and perform similarly on edge cases. These results also indicate that our techniques are increasingly efficient as the matrix size increases.

show abstract

Extending the scope of the Checkpoint‐on‐Failure protocol for forward recovery in standard MPI

Cited by 17 publications

References 21 publications

A survey on checkpointing web services

A survey on checkpointing web services

Parallel reduction to hessenberg form with algorithm-based fault tolerance

Asynchronous and Exact Forward Recovery for Detected Errors in Iterative Solvers

Contact Info

Product

Resources

About