Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing 2017
DOI: 10.1145/3078597.3078617
|View full text |Cite
|
Sign up to set email alerts
|

Towards a More Complete Understanding of SDC Propagation

Abstract: With the rate of errors that can silently effect an application's state/output expected to increase on future HPC machines, numerous application-level detection and recovery schemes have been proposed. Recovery is more efficient when errors are contained and affect only part of the computation's state. Containment is usually achieved by verifying all information leaking out of a statically defined containment domain, which is an expensive procedure. Alternatively, error propagation can be analyzed to bound the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
16
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 24 publications
(16 citation statements)
references
References 48 publications
0
16
0
Order By: Relevance
“…Faults are further distinguished into detected and corrected, detected and uncorrectable, and undetected ones. The latter category can take the form of silent data corruption (SDC) and lead the program to compute the wrong solution unbeknown to the user (Calhoun et al, 2017;De Oliveira et al, 2017;Elliott et al, 2014Elliott et al, , 2015Elliott et al, , 2016Feng et al, 2010;Fiala et al, 2012;Guhur et al, 2017;Li et al, 2018;Michalak et al, 2012). Next to other performance figures in HPC, mean time between failure (MTBF), made of the sum of mean time to interrupt (MTTI) and mean time to repair (MTTR), has arisen as a measure of reliability of a computing system.…”
Section: Taxonomymentioning
confidence: 99%
“…Faults are further distinguished into detected and corrected, detected and uncorrectable, and undetected ones. The latter category can take the form of silent data corruption (SDC) and lead the program to compute the wrong solution unbeknown to the user (Calhoun et al, 2017;De Oliveira et al, 2017;Elliott et al, 2014Elliott et al, , 2015Elliott et al, , 2016Feng et al, 2010;Fiala et al, 2012;Guhur et al, 2017;Li et al, 2018;Michalak et al, 2012). Next to other performance figures in HPC, mean time between failure (MTBF), made of the sum of mean time to interrupt (MTTI) and mean time to repair (MTTR), has arisen as a measure of reliability of a computing system.…”
Section: Taxonomymentioning
confidence: 99%
“…Deleting a i decreases the detection overhead. The total execution times of the instructions mapped by a i is considered as the gain of deleting a i , and is represented by Equation (8). Further, the profit of deleting a i can be expressed by Equation (9).…”
Section: Screening Assertions For Neighbouring Program Pointsmentioning
confidence: 99%
“…However, they require substantial development efforts, and the hardware modules are barely portable. Moreover, as the soft error rates increase, hardware may not provide adequate protection [8]. In contrast, software-based approaches require no hardware and provide high portability and a short development time, and are therefore promising.…”
Section: Introductionmentioning
confidence: 99%
“…Recently, techniques that allow visualization of corrupted application data across loop iterations and MPI processes have been developed. For example, Calhoun et al [17] replicate instructions to track and visualize how errors propagate within the application. However, their approach can be expensive when analyzing complex applications.…”
Section: Related Workmentioning
confidence: 99%
“…To capture and extract these patterns, however, a new method is required. While some methods exist to inject faults and statistically quantify their manifestation, such as random fault injection [2], [9], [10], [11], [12], and to use program analysis [13], [14], [15], [16], [17] to track errors on individual instructions, these methods miss the fine-grained information on error propagation as well as the context needed to explain, at a fine granularity, how errors propagate and consequently how natural resilient computations occur. In other words, these approaches do not provide the needed reasoning about how multiple computations work together to make an error disappear or to diminish its impact.…”
Section: Introductionmentioning
confidence: 99%