We present a new approach to algorithm-based fault tolerance (ABFT) and parity-checking techniques in the design of high performance computing systems. The ABFT technique employs real convolution error-correcting codes to encode the input data. In order to reduce the round-off error from the output decoding process, systematic real convolution encoding is employed. This paper proposes an efficient method to detect the arithmetic errors using convolution codes at the output compared with an equivalent parity value derived from the input data. Number data processing errors are detected by comparing parity values associated with a convolution code. These comparable sets will be very close numerically, although not identical because of round-off error differences between the two parity generation processes. The effects of internal failures and round-off error are modeled by additive error sources located at the output of the processing block and input at threshold detector. This model combines the aggregate effects of errors and applies them to the respective outputs.
In this paper, the authors present a new approach to algorithm based fault tolerance (ABFT) for High Performance computing system. The Algorithm Based Fault Tolerance approach transforms a system that does not tolerate a specific type of fault, called the fault-intolerant system, to a system that provides a specific level of fault tolerance, namely recovery. The ABFT techniques that detect errors rely on the comparison of parity values computed in two ways, the parallel processing of input parity values produce output parity values comparable with parity values regenerated from the original processed outputs, can apply convolution codes for the redundancy. This method is a new approach to concurrent error correction in fault-tolerant computing systems. This paper proposes a novel computing paradigm to provide fault tolerance for numerical algorithms. The authors also present, implement, and evaluate early detection in ABFT.
We present a framework for algorithm-based fault tolerance (ABFT) methods in the design of fault tolerant computing systems. The ABFT error detection technique relies on the comparison of parity values computed in two ways. The parallel processing of input parity values produce output parity values comparable with parity values regenerated from the original processed outputs. Number data processing errors are detected by comparing parity values associated with a convolution code. This article proposes a new computing paradigm to provide fault tolerance for numerical algorithms. The data processing system is protected through parity values defined by a high-rate real convolution code. Parity comparisons provide error detection, while output data correction is affected by a decoding method that includes both round-off error and computer-induced errors. To use ABFT methods efficiently, a systematic form is desirable. A class of burst-correcting convolution codes will be investigated. The purpose is to describe new protection techniques that are easily combined with data processing methods, leading to more effective fault tolerance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.