Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis 2017
DOI: 10.1145/3126908.3126960
|View full text |Cite
|
Sign up to set email alerts
|

Experimental and analytical study of Xeon Phi reliability

Abstract: We present an in-depth analysis of transient faults effects on HPC applications in Intel Xeon Phi processors based on radiation experiments and high-level fault injection. Besides measuring the realistic error rates of Xeon Phi, we quantify Silent Data Corruption (SDCs) by correlating the distribution of corrupted elements in the output to the application's characteristics. We evaluate the benefits of imprecise computing for reducing the programs' error rate. For example, for HotSpot a 0.5% tolerance in the ou… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
16
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
8
2

Relationship

0
10

Authors

Journals

citations
Cited by 50 publications
(16 citation statements)
references
References 41 publications
0
16
0
Order By: Relevance
“…Even though the soft error rate is increasing exponentially, a soft error is still a rare event. For example, even though we have tested high-energy neutrons, it still needs more than 500 h to represent 57,000 years in normal execution [35]. Further, even though we have injected radiation-induced faults into hardware devices, it is hard to analyze masking effects since we cannot determine types of soft errors, such as the number of bits and locality of errors.…”
Section: Discussionmentioning
confidence: 99%
“…Even though the soft error rate is increasing exponentially, a soft error is still a rare event. For example, even though we have tested high-energy neutrons, it still needs more than 500 h to represent 57,000 years in normal execution [35]. Further, even though we have injected radiation-induced faults into hardware devices, it is hard to analyze masking effects since we cannot determine types of soft errors, such as the number of bits and locality of errors.…”
Section: Discussionmentioning
confidence: 99%
“…Given our focus on the protection of structured address generation, we focus on error impacting instructions that contribute to address generation. Precise analysis of the propagation of soft errors from microarchitectural state to the application can require expensive particle-beam experiments (e.g., [7,27]) or time-consuming microarchitectural simulations. One study [6] points to the inadequacies in the space of fault injection at higher levels of abstraction.…”
Section: Error Modelmentioning
confidence: 99%
“…Considering that the common use of ECC (Error-Correcting Code) can mask majority of transient errors in the memory of HPC machines, IterPro mainly focuses on those manifesting from CPU data paths that are difficult or impractical to protect using ECC-like techniques and are attracting increasing concern in the HPC community. For example, Oliveira et al [17] project that a hypothetical exascale machine built with 190, 000 cutting-edge Xeon Phi processors would experience daily transient errors with their memory areas protected with the ECC.…”
Section: Introductionmentioning
confidence: 99%