This paper describes an approach to providing software fault tolerance for future deep-space robotic National Aeronautics and Space Administration missions, which will require a high degree of autonomy supported by an enhanced on-board computational capability. We focus on introspection-based adaptive fault tolerance guided by the specific requirements of applications. Introspection supports monitoring of the program execution with the goal of identifying, locating, and analyzing errors. Fault tolerance assertions for the introspection system can be provided by the user, domain-specific knowledge, or via the results of static or dynamic program analysis. This work is part of an on-going project at the Jet Propulsion Laboratory in Pasadena, California. 2193 systems control the new generation of fly-by-wire aircraft, such as the Airbus and Boeing airliners. Most space missions of the past were largely controlled from Earth, so that a significant number of failures could be handled by putting the spacecraft in a 'safe' mode, with Earth-bound controllers attempting to return it to operational mode. This approach will no longer work for future robotic deep-space missions, which will require enhanced autonomy and a powerful on-board computational capability. Such missions are becoming possible as a result of recent advances in microprocessor technology, which are leading to low-power many-core chips that today already have on the order of 100 cores. These developments imply a range of consequences for fault tolerance, some of them challenging and others providing new opportunities. In this paper, we focus on an approach for software-implemented application-adaptive fault tolerance, which is made possible by the enhanced multithreading capability of modern hardware. This paper is an extended and modified version of a paper presented at the Euro-Par 2010 conference [2]. It is structured as follows: In Section 2, we establish a conceptual basis, providing more precise definitions for the notions of dependability and fault tolerance. Section 3 gives an overview of future missions and their requirements. After outlining the global structure of our approach in Section 4, we take a closer look at the introspection framework and its structure (Section 5). Adaptive fault tolerance is discussed in Section 6. The paper ends with an overview of related work and concluding remarks in Sections 7 and 8.
FAULT TOLERANCE IN THE CONTEXT OF DEPENDABILITY
Methods for fault detection and recoveryIntrospection-based fault tolerance provides a flexible approach that in addition to applying innovative methods can leverage existing technology. Methods that are useful in this context include assertion-based acceptance tests that check the value of an assertion and transfer control to the IFT