A Scalable Communication-Induced Checkpointing Algorithm for Distributed Systems

Simon, A.; Hernández, Saúl E. Pomares; Cruz, José Roberto Pérez; Gómez-Gil, Pilar; Drira, Khalil

doi:10.1587/transinf.e96.d.886

Cited by 4 publications

(1 citation statement)

References 10 publications

(19 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…An optimized version of FINE, called LazyFINE, applies a lazy strategy using the work of Lou and Manivannan [21,22]. Finally, Simon et al [9,12,23] propose another FI variant, which addresses system scalability, aimed for large-scale systems. Simon et al reduce the number of forced checkpoints by delaying non-forced checkpoints.…”

Section: Related Workmentioning

confidence: 99%

Autonomic Web Services Based on Different Adaptive Quasi-Asynchronous Checkpointing Techniques

et al. 2020

Self Cite

View full text Add to dashboard Cite

Companies, organizations and individuals use Web services to build complex business functionalities. Web services must operate properly in the unreliable Internet infrastructure even in the presence of failures. To increase system dependability, organizations, including service providers, adapt their systems to the autonomic computing paradigm. Strategies can vary from having one to all (S-CHOP, self-configuration, self-healing, self-optimization and self-protection) features. Regarding self-healing, an almost identical tool is communication-induced checkpointing (CiC), a checkpoint contains the state (heap, registers, stack, kernel state) for each process in the system. CiC is based on quasi-synchronous checkpointing where processes take checkpoints relying of control information piggybacked inside application messages; however, avoiding dangerous patterns such as Z-paths and Z-cycles; in such a regard the system takes forced checkpoints and avoids inconsistent states. CiC, unlike other tools, does not incur system performance, our proposal does not incur high overhead (as results show), and it has the advantage of being scalable. As we have shown in a previous work, CiC can be used to address dependability problems when dealing with Web services, as CiC mechanism work in a distributed and efficient manner. Therefore, in this work we propose an adaptable and dynamic generation of checkpoints to support fault tolerance. We present an alternative considering Quality of Service (QoS) criteria, and the different impact applications have on it. We propose taking checkpoints dynamically in case of failure or QoS degradation. Experimental results show that our approach has significantly reduced the generation of checkpoints of various well-known tools in the literature.

show abstract

Section: Related Workmentioning

confidence: 99%

Autonomic Web Services Based on Different Adaptive Quasi-Asynchronous Checkpointing Techniques

et al. 2020

Self Cite

View full text Add to dashboard Cite

show abstract

An efficient validation approach for quasi-synchronous checkpointing oriented to distributed diagnosability

Khlif

Kacem

Hernández

et al. 2016

Journal of Systems and Software

View full text Add to dashboard Cite

The Autonomic Computing paradigm is oriented towards enabling complex distributed systems to manage themselves, even in faulty situations. The diagnosability analysis is a priori study through which a system can be self-aware about its current state. It is from the determination of a consistent state that a system can take some actions to repair or reconfigure itself. Nevertheless, in a distributed system it is hard to determine consistent states since we cannot observe simultaneously all the local variables of different processes. In this context, the challenge is to efficiently monitor the system execution over time to capture trace information in order to determine if the system accomplishes both functional and non-functional requirements. Quasi-Synchronous Checkpointing is a technique that collects information from which a system can establish consistent snapshots. Based on this technique, several checkpointing algorithms have been developed. According to the checkpoint properties, they are classified into: Strictly Z-Path Free (SZPF), Z-Path Free (ZPF) and Z-Cycle Free (ZCF). Checkpointing algorithms are often evaluated with regard to performance, generally through simulation. However, their correctness has been mildly studied. In this paper, we propose an efficient validation approach based on a graph transformation oriented towards the automatic detection of the aforementioned properties.

show abstract

A Validation Approach for Quasi-Synchronous Checkpointing Algorithms in HPC Systems

Khlif

Kacem

Hernández

et al. 2017

2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA)

View full text Add to dashboard Cite

A Scalable Communication-Induced Checkpointing Algorithm for Distributed Systems

Cited by 4 publications

References 10 publications

Autonomic Web Services Based on Different Adaptive Quasi-Asynchronous Checkpointing Techniques

Autonomic Web Services Based on Different Adaptive Quasi-Asynchronous Checkpointing Techniques

An efficient validation approach for quasi-synchronous checkpointing oriented to distributed diagnosability

A Validation Approach for Quasi-Synchronous Checkpointing Algorithms in HPC Systems

Contact Info

Product

Resources

About