Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. Critical infrastructure applications provide services upon which society depends heavily; such applications require constant, dependable operation in the face of various failures, natural disasters, and other disruptive events that might cause a loss of service. These applications are themselves dependent on distributed information systems for all aspects of their operation, so survivability of these critical information systems is an important issue. Survivability is the ability of a system to continue to provide service, though possibly alternate or degraded, in the face of various types of failure and disruption. A fundamental mechanism by which survivability can be achieved in critical information systems is fault tolerance. Much of the literature on fault-tolerant distributed systems focuses on tolerance of local faults by detecting and masking the effects of those faults. I describe a direction for fault tolerance in the face of non-local faults-faults whose effects have significant non-local impact, sometimes widespread and sometimes catastrophic-where often the effects of these faults cannot be masked using available resources. The goal is to recognize these non-local faults through detection and analysis, then to provide continued service (possibly alternate or degraded) by reconfiguring the system in response to these faults.A specification-based approach to fault tolerance, called RAPTOR, is presented that enables systematic structuring of formal specifications for error detection and recovery, utilizes a translator to synthesize portions of the implementation from the formal specifications, and provides an implementation architecture supporting fault-tolerance activities. The RAPTOR approach consists of three specifications describing the fault-tolerant system, the errors to be detected, and the actions to take to recover from those errors. The RAPTOR System includes a synthesizer, the Fault Tolerance Translator, to generate implementation of code components from the specifications to perform error detection and recovery activities. In addition, a novel implementation architecture incorporates the generated code as part of an infrastructure supporting fault tolerance at both the node and syst...