Towards a More Complete Understanding of SDC Propagation

Calhoun, Jon C.; Snir, Marc; Olson, Luke N.; Gropp, William

doi:10.1145/3078597.3078617

Cited by 24 publications

(16 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Faults are further distinguished into detected and corrected, detected and uncorrectable, and undetected ones. The latter category can take the form of silent data corruption (SDC) and lead the program to compute the wrong solution unbeknown to the user (Calhoun et al, 2017;De Oliveira et al, 2017;Elliott et al, 2014Elliott et al, , 2015Elliott et al, , 2016Feng et al, 2010;Fiala et al, 2012;Guhur et al, 2017;Li et al, 2018;Michalak et al, 2012). Next to other performance figures in HPC, mean time between failure (MTBF), made of the sum of mean time to interrupt (MTTI) and mean time to repair (MTTR), has arisen as a measure of reliability of a computing system.…”

Section: Taxonomymentioning

confidence: 99%

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

Benacchio

Bonaventura

Altenbernd

et al. 2021

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

Progress in numerical weather and climate prediction accuracy greatly depends on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increased average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to time-critical numerical weather and climate prediction systems. A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analysed and some recommendations outlined for future developments.

show abstract

Section: Taxonomymentioning

confidence: 99%

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

Benacchio

Bonaventura

Altenbernd

et al. 2021

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

show abstract

“…Deleting a i decreases the detection overhead. The total execution times of the instructions mapped by a i is considered as the gain of deleting a i , and is represented by Equation (8). Further, the profit of deleting a i can be expressed by Equation (9).…”

Section: Screening Assertions For Neighbouring Program Pointsmentioning

confidence: 99%

“…However, they require substantial development efforts, and the hardware modules are barely portable. Moreover, as the soft error rates increase, hardware may not provide adequate protection [8]. In contrast, software-based approaches require no hardware and provide high portability and a short development time, and are therefore promising.…”

Section: Introductionmentioning

confidence: 99%

F_Radish: Enhancing Silent Data Corruption Detection for Aerospace-Based Computing

Yang

Wang

2020

Electronics

View full text Add to dashboard Cite

Radiation-induced soft errors degrade the reliability of aerospace-based computing. Silent data corruption (SDC) is the most dangerous and insidious type of soft error result. To detect SDC, program invariant assertions are used to harden programs. However, there exist redundant assertions in hardened programs, which impairs the detection efficiency. Benign errors are another type of soft error result. An assertion may detect benign errors, incurring unnecessary recovery overhead. The detection degree of an assertion represents the detection capability, and an assertion with a high detection degree can detect severe errors. To improve the detection efficiency and detection degree while reducing the benign detection ratio, F_Radish is proposed in the present work to screen redundant assertions in a novel way. At a program point, the detection degree and benign detection ratio are considered to evaluate the importance of the assertions in the program point. As a result, only the most important assertion remains in the program point. Moreover, the redundancy degree is considered to screen redundant assertions for neighbouring program points. Experimental results show that in comparison with the Radish approach, the detection efficiency of F_Radish is about two times greater. Moreover, F_Radish reduces the benign detection ratio and improves the detection degree. It can avoid more unnecessary recovery overheads and detect more serious SDC than can Radish.

show abstract

“…Recently, techniques that allow visualization of corrupted application data across loop iterations and MPI processes have been developed. For example, Calhoun et al [17] replicate instructions to track and visualize how errors propagate within the application. However, their approach can be expensive when analyzing complex applications.…”

Section: Related Workmentioning

confidence: 99%

“…To capture and extract these patterns, however, a new method is required. While some methods exist to inject faults and statistically quantify their manifestation, such as random fault injection [2], [9], [10], [11], [12], and to use program analysis [13], [14], [15], [16], [17] to track errors on individual instructions, these methods miss the fine-grained information on error propagation as well as the context needed to explain, at a fine granularity, how errors propagate and consequently how natural resilient computations occur. In other words, these approaches do not provide the needed reasoning about how multiple computations work together to make an error disappear or to diminish its impact.…”

Section: Introductionmentioning

confidence: 99%

FlipTracker: Understanding Natural Error Resilience in HPC Applications

Guo

Liu

Laguna

et al. 2018

SC18: International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

As high-performance computing systems scale in size and computational power, the danger of silent errors, i.e., errors that can bypass hardware detection mechanisms and impact application state, grows dramatically. Consequently, applications running on HPC systems need to exhibit resilience to such errors. Previous work has found that, for certain codes, this resilience can come for free, i.e., some applications are naturally resilient, but few studies have shown the code patterns-combinations or sequences of computations-that make an application naturally resilient. In this paper, we present FlipTracker, a framework designed to extract these patterns using fine-grained tracking of error propagation and resilience properties, and we use it to present a set of computation patterns that are responsible for making representative HPC applications naturally resilient to errors. This not only enables a deeper understanding of resilience properties of these codes, but also can guide future application designs towards patterns with natural resilience.

show abstract

Towards a More Complete Understanding of SDC Propagation

Cited by 24 publications

References 48 publications

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

F_Radish: Enhancing Silent Data Corruption Detection for Aerospace-Based Computing

FlipTracker: Understanding Natural Error Resilience in HPC Applications

Contact Info

Product

Resources

About