Computer systems for critical applications must be designed to to tolerate software faults as well as hardware faults. A unified approach to tolerating hardware and software faults is characterized by classifying faults in terms of duration (transient or permanent) rather than source (hardware or software). Errors arising from transient faults can be handled through masking or voting, but errors arising from permanent faults require system reconfiguration t o bypass the failed component. Most errors which are caused by software faults can be considered transient, in that they are input dependent. Quantitative dependability analysis of systems which exhibit a unified approach to fault tolerance can be performed by a hierarchical combination of fault tree and Markov models. In this paper, a methodology for analyzing hardware and software fault tolerant systems is applied to the analysis of a hypothetical system, loosely based on the Fault Tolerant Parallel Processor (FTPP) [7]. The models considers both transient and permanent faults, hardware and software faults, unrelated and related software faults, automatic recovery and reconfiguration. The parameter values for the software part of the model are determined from an experimental implementation of an N-version programming application. The parameter values chosen for the hardware part of the model are considered fairly typical.of life, or severe economic or environmental damage. In order to meet stringent dependability requirements, fault tolerant computer systems often employ similar and dissimilar redundancy and complex recovery mechanisms to tolerate hardware and software faults [lo]. Dependability analysis of critical systems often requires a hierarchical approach combining several different modeling techniques. In this paper, we demonstrate such a hierarchical approach using Markov models, fault trees and several combinatorial equations to analyze a hypothetical fault tolerant system. We begin with a description of the system to be analyzed, and then proceed to develop the model incrementally. The resulting model is a combination of fault tree models for the analysis of software fault tolerance, a Markov model for the analysis of hardware fault tolerance, and several combinatorial equations to combine the analyses.
Example System Description -
FTPP clusterThe example system to be analyzed in this paper is a hypothetical FTPP [7] cluster which is designed to tolerate both hardware and software faults. The cluster consists of sixteen processing elements (PE), with four connected to each of four network elements (NE). The network elements are fully connected, and form a Byzantine Resilient core for the cluster. Four of the processors (one on each NE for Byzantine resilience), those labeled Q1, Q2, Q3 and Q4, form a