2003
DOI: 10.1109/tc.2003.1197125
|View full text |Cite
|
Sign up to set email alerts
|

Tests and tolerances for high-performance software-implemented fault detection

Abstract: Abstract-We describe and test a software approach to fault detection in common numerical algorithms. Such result checking or algorithm-based fault tolerance (ABFT) methods may be used, for example, to overcome single-event upsets in computational hardware or to detect errors in complex, high-efficiency implementations of the algorithms. Following earlier work, we use checksum methods to validate results returned by a numerical subroutine operating subject to unpredictable errors in data. We consider common mat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
14
0

Year Published

2004
2004
2018
2018

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 30 publications
(14 citation statements)
references
References 32 publications
0
14
0
Order By: Relevance
“…ABFT [9,27,32] techniques are tailored solutions to specific numerical algorithms. As a result, they are usually efficient.…”
Section: Related Workmentioning
confidence: 99%
“…ABFT [9,27,32] techniques are tailored solutions to specific numerical algorithms. As a result, they are usually efficient.…”
Section: Related Workmentioning
confidence: 99%
“…The effects of multiple faults, including those that occur during the postcondition test itself, have been explored through experiment [17]. Previous work has also explored the setting of error bounds for checksum tests [5], [11], [18].…”
Section: Radiation Detectionmentioning
confidence: 99%
“…A common hardware technique for achieving radiation protection for SRAM is Triple-Modular Redundancy (TMR), in which three identical components perform the same memory operations and then vote on the result [2]. Softwarebased strategies include error detection and correction (EDAC) codes, which employ a "memory scrubber" process to run continually in the background to correct errors [3], and algorithm-specific tests to detect when an error has occurred (e.g., [4], [5]). Most of the latter has focused on general purpose computing.…”
Section: Introduction and Objectivesmentioning
confidence: 99%
“…Like result checking techniques [18,20], postconditions depend upon the function being computed regardless of the underlying implementation algorithm.…”
Section: Assertion Extensionsmentioning
confidence: 99%
“…Due to the very precise, compute-intensive nature of science and engineering applications, they are more susceptible to overflow, underflow, and round-off errors than most IT applications [12,18,20]. The aggregation of round-off errors over the life of an iterative computation that can take days, weeks, or months to run can result in a tremendous waste of time and compute resources.…”
Section: Adaptation Strategiesmentioning
confidence: 99%