This paper surveys the various problems involved in achieving very high rehability from complex computing systems, and discusses the relatmnship between system structurmg techniques and techniques of fault tolerance. Topics covered mclude: 1) protective redundancy in hardware and software; 2) the use of atomic actmns to structure the activity of a system to limit mformatmn flow; 3) error detection techniques; 4) strategies for locating and dealmg with faults and for assessing the damage they have caused; and 5) forward and backward error recovery techmques, based on the concepts of recovery line, commitment, exceptmn, and compensation. The ideas described relate to techmques used to date in systems mtended for environments in whmh high reliability is demanded Three specific systems the JPL-STAR, the Bell Laboratories ESS No. 1A processor, and the PLURIBUS are described m some detail and compared.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.