Experimental and analytical study of Xeon Phi reliability

Oliveira, Daniel; Pilla, Laércio Lima; DeBardeleben, Nathan; Blanchard, Sean; Quinn, Heather; Koren, Israel; Navaux, Philippe O. A.; Rech, Paolo

doi:10.1145/3126908.3126960

Cited by 50 publications

(16 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Even though the soft error rate is increasing exponentially, a soft error is still a rare event. For example, even though we have tested high-energy neutrons, it still needs more than 500 h to represent 57,000 years in normal execution [35]. Further, even though we have injected radiation-induced faults into hardware devices, it is hard to analyze masking effects since we cannot determine types of soft errors, such as the number of bits and locality of errors.…”

Section: Discussionmentioning

confidence: 99%

Characterizing System-Level Masking Effects against Soft Errors

2021

Electronics

View full text Add to dashboard Cite

From early design phases to final release, the reliability of modern embedded systems against soft errors should be carefully considered. Several schemes have been proposed to protect embedded systems against soft errors, but they are neither always functional nor robust, even with expensive overhead in terms of hardware area, performance, and power consumption. Thus, system designers need to estimate reliability quantitatively to apply appropriate protection techniques for resource-constrained embedded systems. Vulnerability modeling based on lifetime analysis is one of the most efficient ways to quantify system reliability against soft errors. However, lifetime analysis can be inaccurate, mainly because it fails to comprehensively capture several system-level masking effects. This study analyzes and characterizes microarchitecture-level and software-level masking effects by developing an automated framework with exhaustive fault injections (i.e., soft errors) based on a cycle-accurate gem5 simulator. We injected faults into a register file because errors in the register file can easily be propagated to other components in a processor. We found that only 5% of injected faults can cause system failures on an average over benchmarks, mainly from the MiBench suite. Further analyses showed that 71% of soft errors are overwritten by write operations before being used, and the CPU does not use 20% of soft errors at all after fault injections. The remainder are also masked by several software-level masking effects, such as dynamically dead instructions, compare and logical instructions that do not change the result, and incorrect control flows that do not affect program outputs.

show abstract

Section: Discussionmentioning

confidence: 99%

Characterizing System-Level Masking Effects against Soft Errors

2021

Electronics

View full text Add to dashboard Cite

show abstract

“…Given our focus on the protection of structured address generation, we focus on error impacting instructions that contribute to address generation. Precise analysis of the propagation of soft errors from microarchitectural state to the application can require expensive particle-beam experiments (e.g., [7,27]) or time-consuming microarchitectural simulations. One study [6] points to the inadequacies in the space of fault injection at higher levels of abstraction.…”

Section: Error Modelmentioning

confidence: 99%

FailAmp

Briggs

Das

Baranowski

et al. 2019

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

We present FailAmp, a novel LLVM program transformation algorithm that makes programs employing structured index calculations more robust against soft errors. Without FailAmp, an offset error can go undetected; with FailAmp, all subsequent offsets are relativized, building on the faulty one. FailAmp can exploit ISAs such as ARM to further reduce overheads. We verify correctness properties of FailAMP using an SMT solver, and present a thorough evaluation using many high-performance computing benchmarks under a fault injection campaign. FailAmp provides full soft-error detection for address calculation while incurring an average overhead of around 5%.

show abstract

“…Considering that the common use of ECC (Error-Correcting Code) can mask majority of transient errors in the memory of HPC machines, IterPro mainly focuses on those manifesting from CPU data paths that are difficult or impractical to protect using ECC-like techniques and are attracting increasing concern in the HPC community. For example, Oliveira et al [17] project that a hypothetical exascale machine built with 190, 000 cutting-edge Xeon Phi processors would experience daily transient errors with their memory areas protected with the ECC.…”

Section: Introductionmentioning

confidence: 99%

Near-Zero Downtime Recovery From Transient-Error-Induced Crashes

Chen

Eisenhauer

Pande

2022

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Due to the system scaling, transient errors caused by external noises, e.g., heat fluxes and particle strikes, have become a growing concern for the current and upcoming extreme-scale high-performance-computing (HPC) systems. Applications running on these systems are expected to experience transient errors more frequently than ever before, which will either lead them to generate incorrect outputs or cause them to crash. However, since such errors are still quite rare as compared to no-fault cases, desirable solutions call for low/no-overhead systems that do not compromise the performance under no-fault conditions and also allow very fast fault recovery to minimize downtime. In this paper, we present IterPro, a light-weight compiler-assisted resilience technique to quickly and accurately recover processes from transient-error-induced crashes. During the compilation of applications, IterPro constructs a set of recovery kernels for crash-prone instructions. These recovery kernels are executed to repair the corrupted process states on-the-fly upon occurrences of errors, enabling applications to continue their executions instead of being terminated. When constructing recovery kernels, IterPro exploits side effects introduced by induction variable based code optimization techniques based on loop unrolling and strength reduction to improve its recovery capability. To this end, two new code transformation passes are introduced to expose the side effects for resilience purposes. We evaluated IterPro with 4 scientific workloads as well as the NPB benchmarks suite. During their normal execution, IterPro incurs almost zero runtime overhead and a small, fixed 27MB memory overhead. Meanwhile, IterPro can recover on an average 83.55% of crash-causing errors within dozens of milliseconds with negligible downtime. With such an effective recovery mechanism, IterPro could tremendously mitigate the overheads and resource requirements of the resilience subsystem in future extreme-scale systems.

show abstract

Experimental and analytical study of Xeon Phi reliability

Cited by 50 publications

References 41 publications

Characterizing System-Level Masking Effects against Soft Errors

Characterizing System-Level Masking Effects against Soft Errors

FailAmp

Near-Zero Downtime Recovery From Transient-Error-Induced Crashes

Contact Info

Product

Resources

About