This paper presents a new signature monitoring technique, CFCSS (Control Flow Checking by Software Signatures); CFCSS is a pure software method that checks the control flow of a program using assigned signatures. An algorithm assigns a unique signature to each node in the program graph and adds instructions for error detection. Signatures are embedded in the program during compilation time using the constant field of the instructions and compared with run-time signatures when the program is executed. Another algorithm reduces the code size and execution time overhead caused by checking instructions in CFCSS. A "branching fault injection experiment" was performed with benchmark programs. Without CFCSS, an average of 33.7% of the injected branching faults produced undetected incorrect outputs; however, with CFCSS, only 3.1% of branching faults produced undetected incorrect outputs. Thus it is possible to increase error detection coverage for control flow errors by an order of magnitude using CFCSS. The distinctive advantage of CFCSS over previous signature monitoring techniques is that CFCSS is a pure software method, i.e., it needs no dedicated hardware such as a watchdog processor for control flow checking. A watchdog task in multitasking environment also needs no extra hardware, but the advantage of CFCSS over a watchdog task is that CFCSS can be used even when the operating system does not support multitasking.
This paper proposes a pure software technique, Error Detection by Duplicated Instructions (EDDI), for detecting errors during normal system operation. Compared to other error detection techniques that use hardware redundancy, our method does not require any hardware modifications to add error detection capability to the original system. In EDDI, we duplicate instructions during compilation and use different registers and variables for the new instructions. Especially for the fault in the code segment of memory, we have derived formulas to estimate the error detection coverage of EDDI using probabilistic methods. These formulas use statistics of the program, which are collected during compilation. We applied our technique to eight benchmark programs and estimated the error detection coverage. Then, we verified the estimates by simulations, in which a fault injector forced a bit flip in the code segment of executable machine codes. The simulation results validated the estimated fault coverage and show that approximately 1.5% of injected faults produced incorrect results in eight benchmark programs with EDDI, while on average, 20% of injected faults produced undetected incorrect results in the programs without EDDI. Based on the theoretical estimates and actual fault injection experiments, we show that EDDI can provide over 98% fault coverage without any extra hardware for error detection. This pure software technique is especially useful when designers cannot change the hardware system but they need dependability in the computer system. The Control Flow Checking by Software Signatures (CFCSS) technique can be used with EDDI to increase the fault coverage. In order to reduce the performance overhead, our technique schedules the instructions that are added for detecting errors such that Instruction-Level Parallelism (ILP) is maximized. We have showed that the execution time overhead in a 4-way super-scalar processor is less than the execution time overhead in the processors that can issue 2 instructions in one cycle.
& CONCLUSIONSIn many computer systems, the contents of memory are protected by an error detection and correction (EDAC) code. Bit-flips caused by single event upsets (SEUs) are a well-known problem in memory chips and EDAC codes have been an effective solution to this problem. These codes are usually implemented in hardware using extra memory bits and encoding/decoding circuitry. In systems where EDAC hardware is not available, the reliability of the system can be improved by providing protection through software. Codes and techniques that can be used for software implementation of EDAC are discussed and compared. objective of our experiment is to see whether software-implemented hardware faulttolerance -which can include software-implemented EDAC -can provide sufficient reliability for COTS hardware to make it usable in low-radiation space applications.Power fluctuation and electromagnetic interference may cause bit-flips in memories. It has been observed that radiation-induced transient errors also occur at ground level [8]. Therefore, the technique presented in this paper can be useful for terrestrial applications, too.Previous discussions of software-implemented EDAC concentrate on communications and secondary storage systems [9][10][11][12][13][14]. In Sec. 2, we review some of these previous studies. In Sec. 3, we look at the problem in more detail and discuss the requirements of a scheme for the particular application presented above. Four different example EDAC coding schemes were implemented in software. These schemes are compared in Sec. 4. Issues that have to be considered for handling multiple errors and solutions to them are discussed in Sec. 5. Finally, the EDAC program has to be integrated into the whole system. We present our implementation in ARGOS in Sec. 6.The reliability improvement of an application in a space environment is estimated in Sec. 7. We conclude the paper with a discussion in Sec. 8.
This paper presents a new fault-tolerance technique for cache memories. Current fault-tolerance techniques for caches are limited either by the number of faults that can be tolerated or by the rapid degradation of performance as the number of faults increases. In this paper, we present a new technique that overcomes these two problems.This technique uses a special Programmable Address Decoder (PAD) to disable faulty blocks and to re-map their references to healthy blocks. Simulation results show that the performance degradation of direct-mapped caches with PAD is smaller than the previous techniques. However, for set-associative caches, the overhead of PAD is primarily advantageous if a relatively large number of faults is to be tolerated. The area overhead was estimated at about 10% of the overall cache area for a hypothetical design and is expected to be less for actual designs. The access time overhead is negligible.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.