Tests and tolerances for high-performance software-implemented fault detection

Abstract-As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant challenges in reliability. Silent data corruptions (SDCs) or silent errors are one of the major sources that corrupt the execution results of HPC applications without being detected.In this work, we explore a low-memory-overhead SDC detector, by leveraging epsilon-insensitive support vector machine regression, to detect SDCs that occur in HPC applications that can be characterized by an impact error bound. The key contributions are three fold. (1) Our design takes spatial features (i.e., neighbouring data values for each data point in a snapshot) into training data, such that little memory overhead (less than 1%) is introduced. (2) We provide an in-depth study on the detection ability and performance with different parameters, and we optimize the detection range carefully. (3) Experiments with eight real-world HPC applications show that our detector can achieve the detection sensitivity (i.e., recall) up to 99% yet suffer a less than 1% of false positive rate for most cases. Our detector incurs low performance overhead, 5% on average, for all benchmarks studied in the paper. Compared with other state-of-the-art techniques, our detector exhibits the best tradeoff considering the detection ability and overheads.

show abstract

“…ABFT [9,27,32] techniques are tailored solutions to specific numerical algorithms. As a result, they are usually efficient.…”

Section: Related Workmentioning

confidence: 99%

Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era

Subaşi

Bautista-Gomez

et al. 2016

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

View full text Add to dashboard Cite

show abstract

“…The effects of multiple faults, including those that occur during the postcondition test itself, have been explored through experiment [17]. Previous work has also explored the setting of error bounds for checksum tests [5], [11], [18].…”

Section: Radiation Detectionmentioning

confidence: 99%

“…A common hardware technique for achieving radiation protection for SRAM is Triple-Modular Redundancy (TMR), in which three identical components perform the same memory operations and then vote on the result [2]. Softwarebased strategies include error detection and correction (EDAC) codes, which employ a "memory scrubber" process to run continually in the background to correct errors [3], and algorithm-specific tests to detect when an error has occurred (e.g., [4], [5]). Most of the latter has focused on general purpose computing.…”

Section: Introduction and Objectivesmentioning

confidence: 99%

Simulating and Detecting Radiation-Induced Errors for Onboard Machine Learning

Granat

Wagstaff

Bornstein

et al. 2009

2009 Third IEEE International Conference on Space Mission Challenges for Information Technology

View full text Add to dashboard Cite

Abstract-Spacecraft processors and memory are subjected to high radiation doses and therefore employ radiation-hardened components. However, these components are orders of magnitude more expensive than typical desktop components, and they lag years behind in terms of speed and size. We have integrated algorithm-based fault tolerance (ABFT) methods into onboard data analysis algorithms to detect radiation-induced errors, which ultimately may permit the use of spacecraft memory that need not be fully hardened, reducing cost and increasing capability at the same time. We have also developed a lightweight software radiation simulator, BITFLIPS, that permits evaluation of error detection strategies in a controlled fashion, including the specification of the radiation rate and selective exposure of individual data structures. Using BITFLIPS, we evaluated our error detection methods when using a support vector machine to analyze data collected by the Mars Odyssey spacecraft. We observed good performance from both an existing ABFT method for matrix multiplication and a novel ABFT method for exponentiation. These techniques bring us a step closer to "radhard" machine learning algorithms.

show abstract

“…Like result checking techniques [18,20], postconditions depend upon the function being computed regardless of the underlying implementation algorithm.…”

Section: Assertion Extensionsmentioning

confidence: 99%

“…Due to the very precise, compute-intensive nature of science and engineering applications, they are more susceptible to overflow, underflow, and round-off errors than most IT applications [12,18,20]. The aggregation of round-off errors over the life of an iterative computation that can take days, weeks, or months to run can result in a tremendous waste of time and compute resources.…”

Section: Adaptation Strategiesmentioning

confidence: 99%