System structure for software fault tolerance

Randell, Brian

doi:10.1145/390016.808467

Cited by 745 publications

(269 citation statements)

References 3 publications

Supporting

Mentioning

265

Contrasting

Unclassified

Order By: Relevance

“…When saving the state of the communication links, the Domino Effect [18] has to avoided. Because of the high dynamicity of agent systems, independent checkpointing techniques are beneficial against coordinated checkpointing algorithms.…”

Section: Discussion Of Fault Tolerance Methodsmentioning

confidence: 99%

FANTOMAS Fault Tolerance for Mobile Agents in Clusters

Pals

Petri

Grewe

2000

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. To achieve an efficient utilization of cluster systems, a proper programming and operating environment is required. In this context, mobile agents are of growing interest as base for distributed and parallel applications. As mobile and autonomous software units, mobile agents can execute tasks given to the system and allocate independently all the needed resources. However, with growing cluster sizes, the probability of a failure of one or more system components and therewith the loss of mobile agents rises. While fault tolerance issues for applications based on "traditional" processes have been extensively studied, current agent environments provide only insufficient, if at all, extensions for a capable reaction on such kinds of failures. We examine fault tolerance with regard to properties and requirements of mobile agents, and find that independent checkpointing with receiver based message logging is appropriate in this context. We derive the FANTOMAS (Fault-Tolerant approach for Mobile Agents) design which offers a user transparent fault tolerance that can be activated on request, according to the needs of the task. A theoretical analysis examines the advantages and drawbacks of FANTOMAS.

show abstract

Section: Discussion Of Fault Tolerance Methodsmentioning

confidence: 99%

FANTOMAS Fault Tolerance for Mobile Agents in Clusters

Pals

Petri

Grewe

2000

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Error isolation and containment by using virtual memory protection has also been studied for device drivers [4]. Multi-version techniques include recovery blocks [47] and N-version software [48].…”

Section: Related Workmentioning

confidence: 99%

Exception Handling in the Choices Operating System

David

Carlyle

Chan

et al. 2006

Advanced Topics in Exception Handling Techniques

View full text Add to dashboard Cite

Abstract. Exception handling is a powerful abstraction that can be used to help manage errors and support the construction of reliable operating systems. Using exceptions to notify system components about exceptional conditions also reduces coupling of error handling code and increases the modularity of the system. We explore the benefits of incorporating exception handling into the Choices operating system in order to improve reliability. We extend the set of exceptional error conditions in the kernel to include critical kernel errors such as invalid memory access and undefined instructions by wrapping them with language-based software exceptions. This allows developers to handle both hardware and software exceptions in a simple and unified manner through the use of an exception hierarchy. We also describe a catch-rethrow approach for exception propagation across protection domains. When an exception is caught by the system, generic recovery techniques like policy-driven micro-reboots and restartable processes are applied, thus increasing the reliability of the system.

show abstract

“…The results produced by the processors involved in the execution of the same task are collected by the Error Management (EM) component, which selects the result to be delivered by applying an adjudication function, and either forwards it to the users or stores it in a stable storage to be used in subsequent computations. When dynamic error processing mechanisms are employed [2,10], redundant execution of an applicative task might be performed in phases, where the execution of further copies of the application is conditional on the absence of an adjudged result in the current phase, as notified by the EM; this implies information exchange between EM and the Planner. EM provides also information to another component, the Diagnosis Mechanism (DM): for each redundant task execution EM delivers to DM a notification about the processor(s) that originated disagreeing results with respect to the adjudicated output.…”

Section: System Modelmentioning

confidence: 99%

Evaluation of Fault-Tolerant Multiprocessor Systems for High Assurance Applications

Grandoni¹

2001

The Computer Journal

View full text Add to dashboard Cite

In designing high assurance systems, the dependability goals are achieved through the adoption of several fault tolerance techniques. Unfortunately, their combined effect on the system cannot be, in the general case, derived by straightforward composition of the stand-alone component's analysis, because of mutual dependence of their controlling parameters. In this paper the assessment of overall system dependability induced by such integrated fault tolerance organizations is carried out through a stochastic simulation approach. To this purpose, a few fault tolerant multiprocessor architectures, based on the integrated usage of standard error processing structures with a recently proposed diagnostic mechanism, called -count, are selected and evaluated. The diagnostic mechanism gets its input (error signals) from the error processing mechanism, whose behaviour is in turn influenced by the rapidity and correctness with which -count identifies permanently/intermittently faulty processors. The choice of the basic fault tolerance mechanisms to adopt, as well as the reference system architecture, has been driven by the characteristics of the envisaged target applications: mainly, stringent dependability requirements, to be traded with adequate levels of performance and cost. The analysis has been focused on performability, which is an appropriate measure to evaluate whether a certain design is "better" than another under dependability and performance point of view.

show abstract

System structure for software fault tolerance

Cited by 745 publications

References 3 publications

FANTOMAS Fault Tolerance for Mobile Agents in Clusters

FANTOMAS Fault Tolerance for Mobile Agents in Clusters

Exception Handling in the Choices Operating System

Evaluation of Fault-Tolerant Multiprocessor Systems for High Assurance Applications

Contact Info

Product

Resources

About