Empirical Studies of the Soft Error Susceptibility ofSorting Algorithms to Statistical Fault Injection

Guan, Qiang; DeBardeleben, Nathan; Blanchard, Sean; Fu, Song

doi:10.1145/2751504.2751512

Cited by 9 publications

(3 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Fang et al [10] discussed a systems approach to SDCs. Other papers describe SDC-resilience for sorting algorithms [11] and matrix factorization [27], and radiation-induced SDCs in GPUs [25]. We did not find prior work related to mercurial cores in HPC.…”

Section: Related Workmentioning

confidence: 76%

“…Perhaps compilers could detect blocks of code whose correct execution is especially critical (via programmer annotations or impact analysis), and then automatically replicate just these computations. More generally, can we extend the class of SDC-resilient algorithms beyond sorting and matrix factorization [11,27]? That prior work evaluated algorithms using fault injection, a technique that does not require access to a large fleet.…”

Section: Next Steps and Research Directionsmentioning

confidence: 99%

See 1 more Smart Citation

Cores that don't count

Hochschild

Turner

Mogul

et al. 2021

Proceedings of the Workshop on Hot Topics in Operating Systems

109

View full text Add to dashboard Cite

We are accustomed to thinking of computers as fail-stop, especially the cores that execute instructions, and most system software implicitly relies on that assumption. During most of the VLSI era, processors that passed manufacturing tests and were operated within specifications have insulated us from this fiction. As fabrication pushes towards smaller feature sizes and more elaborate computational structures, and as increasingly specialized instruction-silicon pairings are introduced to improve performance, we have observed ephemeral computational errors that were not detected during manufacturing tests. These defects cannot always be mitigated by techniques such as microcode updates, and may be correlated to specific components within the processor, allowing small code changes to effect large shifts in reliability. Worse, these failures are often "silent" -the only symptom is an erroneous computation.We refer to a core that develops such behavior as "mercurial." Mercurial cores are extremely rare, but in a large fleet of servers we can observe the disruption they cause, often enough to see them as a distinct problem -one that will require collaboration between hardware designers, processor vendors, and systems software architects.This paper is a call-to-action for a new focus in systems research; we speculate about several software-based approaches to mercurial cores, ranging from better detection and isolating mechanisms, to methods for tolerating the silent data corruption they cause.

show abstract

Section: Related Workmentioning

confidence: 76%

Section: Next Steps and Research Directionsmentioning

confidence: 99%

Cores that don't count

Hochschild

Turner

Mogul

et al. 2021

Proceedings of the Workshop on Hot Topics in Operating Systems

109

View full text Add to dashboard Cite

show abstract

“…From another point of view, as HPC power is targeting applications beyond the graphics domain, such as scientific applications and stock markets, it faces the challenge of addressing the need to generate accurate results that should be free of errors, as these applications cannot tolerate the existence of errors as graphical applications [7]. Hard errors are not the only concern of the HPC community, soft errors are a concern as well [8]. In [9] a study done on the data of two large-scale sites of a set of systems showed that hardware and software errors covering a considerable large proportion of root causes of failures.…”

Section: Introductionmentioning

confidence: 99%

A Two-Level Fault-Tolerance Technique for High Performance Computing Applications

Aseeri¹,

Fadel²

2018

ijacsa

View full text Add to dashboard Cite

Reliability is the biggest concern facing future extreme-scale, high performance computing (HPC) systems. Within the current generation of HPC systems, projections suggest that errors will occur with very high rates in future systems. Thus, it is fundamental that we detect errors that can cause the failure of important applications, such as scientific ones. In this paper, we have presented a two-level fault-tolerance approach for the detection and classification of errors for Compute United Device Architecture (CUDA)-based Graphics Processing Units (GPUs). In the first level, it detects the existence of errors by using software redundancy that applies design diversity. In the second level, it investigates the problematic software version and re-executes it on a different hardware component to classify whether the error is a permanent hardware error or a software error. We implemented our approach to run on GPUs and conducted proof of concept experiments by running three versions of matrix multiplications with different error scenarios and results show the feasibility of the proposed approach.

show abstract