2018 IEEE International Reliability Physics Symposium (IRPS) 2018
DOI: 10.1109/irps.2018.8353563
|View full text |Cite
|
Sign up to set email alerts
|

Exascale fault tolerance challenge and approaches

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2019
2019
2020
2020

Publication Types

Select...
2
2

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 10 publications
0
4
0
Order By: Relevance
“…The dark blue bars are the estimated outcome rates (i.e.,p 2 ,p 3 , andp 4 ) obtained using our MBU estimation model (6). Finally, the thin light blue bars represent the estimated results obtained using the naïve model (3). Since the estimations are based on the single-bit fault injection results (p 1 ), there are no estimated values for the SBU cases (i.e., there is nop 1 ).…”
Section: B Evaluation Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…The dark blue bars are the estimated outcome rates (i.e.,p 2 ,p 3 , andp 4 ) obtained using our MBU estimation model (6). Finally, the thin light blue bars represent the estimated results obtained using the naïve model (3). Since the estimations are based on the single-bit fault injection results (p 1 ), there are no estimated values for the SBU cases (i.e., there is nop 1 ).…”
Section: B Evaluation Resultsmentioning
confidence: 99%
“…Those transient bit-flip faults can result in catastrophic consequences, such as system crash or even undetected data corruptions. The probability of soft errors at system-level increases due to the increased number of devices (i.e., the number of memory cells or flip-flops) in a system [3]. Moreover, as technology scales, the probability of having multiple affected nodes per event (singleevent multiple upsets, or SEMUs) also emerges [4]- [6].…”
Section: Introductionmentioning
confidence: 99%
“…It is expected that the next generation of HPC systems experiences failures every few hours [10], [28]. Consequently, most longrunning HPC applications will experience multiple failures during the execution due to the reduced mean time between failures (MTBF) [18], [24]. Usually, HPC applications employ checkpoint and restart (CR) to recover from failures [11].…”
Section: Introductionmentioning
confidence: 99%
“…Resiliency, in addition to concurrency and energy efficiency, is and will be a major challenge for future high‐performance computing (HPC) architectures . Preliminary estimates suggest mean time to interrupt (MTTI) could be from a few hours to a day as the concurrency on future systems increases rapidly . Without additional effort, applications tend to be susceptible to this reduction in MTTI, resulting in potentially substantial losses in productivity for end users.…”
Section: Introductionmentioning
confidence: 99%