Real-time systems are being increasingly used in several applications which are time critical in nature. Fault-tolerance is an important requirement of such systems, due to the catastrophic consequences of not tolerating faults. In this paper, we study a scheme that provides fault-tolerance through scheduling in real-time multiprocessor systems. We schedule multiple copies of dynamic, aperiodic, nonpreemptive tasks in the system, and use two techniques that we call deallocation and overloading to achieve high acceptance ratio (percentage of arriving tasks scheduled by the system). This paper compares the performance of our scheme with that of other fault-tolerant scheduling schemes, and determines how much each of deallocation and overloading affects the acceptance ratio of tasks. The paper also provides a technique that can help real-time system designers determine the number of processors required to provide fault-tolerance in dynamic systems. Lastly, a formal model is developed for the analysis of systems with uniform tasks.
Abstract-Real-time systems (RTS) are those whose correctness depends on satisfying the required functional as well as the required temporal properties. Due to the criticality of such systems, recovery from faults is an essential part of a RTS. In many systems, such as those supporting space applications, single event upsets (SEUs) are the prevalent type of faults; SEUs are transient faults and affect a single task at a time. This paper presents a scheme to guarantee that the execution of real-time tasks can tolerate SEUs and intermittent faults assuming any queue-based scheduling technique. Three algorithms are presented to solve the problem of adding fault tolerance to a queue of real-time tasks by reserving sufficient slack in a schedule so that recovery can be carried out before the task deadline without compromising guarantees given to other tasks. The first algorithm is a dynamic programming optimal solution, the second is a linear-time heuristic for scheduling dynamic tasks, and the third algorithm comprises extensions to address queues with gaps between tasks (gaps are caused by precedence, resource, or timing constraints). We show through simulations that the heuristics closely approximate the optimal algorithm. Finally, the paper describes the implementation of the modified admission control algorithm, the nonpreemptive scheduler, and a recovery mechanism in the FT-RT-Mach operating system.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.