2008 Third International Conference on Availability, Reliability and Security 2008
DOI: 10.1109/ares.2008.171
|View full text |Cite
|
Sign up to set email alerts
|

A Framework for Proactive Fault Tolerance

Abstract: Fault tolerance is a major concern to guarantee availability of critical services as well as application execution. Traditional approaches for fault tolerance include checkpoint/restart or duplication. However it is also possible to anticipate failures and proactively take action before failures occur in order to minimize failure impact on the system and application execution.This document presents a proactive fault tolerance framework. This framework can use different proactive fault tolerance mechanisms, i.e… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
21
0

Year Published

2009
2009
2022
2022

Publication Types

Select...
5
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 54 publications
(21 citation statements)
references
References 4 publications
0
21
0
Order By: Relevance
“…Based on when a response is initiated with respect to the occurrence of the failure, approaches can be classified as proactive and reactive. Proactive approaches predict failures of computing resources before they occur and then relocate a job executing on resources anticipated to fail onto resource that are not predicted to fail (for example [32,43,44] The control of a fault tolerant approach can be either centralised or distributed. In approaches where the control is centralised, one or more servers are used for backup and a single process responsible for monitoring jobs that are executed on a network of nodes.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Based on when a response is initiated with respect to the occurrence of the failure, approaches can be classified as proactive and reactive. Proactive approaches predict failures of computing resources before they occur and then relocate a job executing on resources anticipated to fail onto resource that are not predicted to fail (for example [32,43,44] The control of a fault tolerant approach can be either centralised or distributed. In approaches where the control is centralised, one or more servers are used for backup and a single process responsible for monitoring jobs that are executed on a network of nodes.…”
Section: Discussionmentioning
confidence: 99%
“…It is not desirable to have to restart a job from the beginning if it has been executing for hours or days or months [6]. A key challenge in maintaining the seamless (or near seamless) execution of such jobs in the event of failures is addressed under research in fault tolerance [7,8,9,10].Many jobs rely on fault tolerant approaches that are implemented in the middleware supporting the job (for example [6,11,12,13]). The conventional fault tolerant mechanism supported by the middleware is checkpointing [14,15,16,17], which involves the periodic recording of intermediate states of execution of a job to which execution can be returned if a fault occurs.…”
mentioning
confidence: 99%
“…These two facets are integrated in approaches that combine prediction and migration in proactive FT systems and evaluate different FT policies. In [50], the authors provide a generic framework based on a modular architecture allowing the implementation of new proactive fault tolerance policies/mechanisms. An agent oriented framework [23] was developed for grid computing environments with separate agents to monitor individual classes or subclasses of faults and proactively act to avoid or tolerate a fault.…”
Section: Related Workmentioning
confidence: 99%
“…Application reallocation was performed using a load balancer. Another Type 1 prototype [12] investigated coordination, protocols, and interfaces between individual system components.…”
Section: Proactive Fault Tolerance Frameworkmentioning
confidence: 99%
“…Other initial work focused on a proactive FT framework [12], which combines both to perform prediction triggered migration. However, evaluation and comparison of individual solutions is very difficult at this early research stage due to missing realistic architectural models for the deployment of proactive FT technology in extreme-scale HPC systems.…”
Section: Introductionmentioning
confidence: 99%