Exascale fault tolerance challenge and approaches

McNairy, C.

doi:10.1109/irps.2018.8353563

Cited by 4 publications

(4 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The dark blue bars are the estimated outcome rates (i.e.,p 2 ,p 3 , andp 4 ) obtained using our MBU estimation model (6). Finally, the thin light blue bars represent the estimated results obtained using the naïve model (3). Since the estimations are based on the single-bit fault injection results (p 1 ), there are no estimated values for the SBU cases (i.e., there is nop 1 ).…”

Section: B Evaluation Resultsmentioning

confidence: 99%

“…Those transient bit-flip faults can result in catastrophic consequences, such as system crash or even undetected data corruptions. The probability of soft errors at system-level increases due to the increased number of devices (i.e., the number of memory cells or flip-flops) in a system [3]. Moreover, as technology scales, the probability of having multiple affected nodes per event (singleevent multiple upsets, or SEMUs) also emerges [4]- [6].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Modeling Application-Level Soft Error Effects for Single-Event Multi-Bit Upsets

Cho

Kwon

2019

IEEE Access

View full text Add to dashboard Cite

Transient errors induced by radiations cause bit-flips in flip-flops (flip-flop soft errors).Modeling the error resilience level of a target system for flip-flop soft errors is a crucial step to achieve a costeffective error resilience solution. This step often requires a significant amount of time and effort for a large number of fault injection simulations. As technology scales, the required effort grows in a new dimension with the increased probability of multi-bit upsets (MBUs). In this work, we present a new estimation model that predicts the resulting error resilience levels for the flip-flop MBU cases. This estimation model only requires the measured soft error effects of the single-bit upset (SBU) cases. This model uses two strategies to address how multiple bit-flips that happen simultaneously in a system affects the outcome of application execution. We evaluate the accuracy level of the MBU estimation model using actual fault injection results on two different processor cores. The two main strategies in our estimation model improve the accuracy levels by more than 7×.

show abstract

Section: B Evaluation Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Modeling Application-Level Soft Error Effects for Single-Event Multi-Bit Upsets

Cho

Kwon

2019

IEEE Access

View full text Add to dashboard Cite

show abstract

“…It is expected that the next generation of HPC systems experiences failures every few hours [10], [28]. Consequently, most longrunning HPC applications will experience multiple failures during the execution due to the reduced mean time between failures (MTBF) [18], [24]. Usually, HPC applications employ checkpoint and restart (CR) to recover from failures [11].…”

Section: Introductionmentioning

confidence: 99%

Design and Study of Elastic Recovery in HPC Applications

Keller

Parasyris

Bautista-Gomez

2020

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

View full text Add to dashboard Cite

The efficient utilization of current supercomputing systems with deep storage hierarchies demands scientific applications that are capable of leveraging such heterogeneous hardware. Fault tolerance, and checkpointing in particular, is one of the most time-consuming aspects if not handled correctly. High checkpoint performance can be achieved using optimized multilevel checkpoint and restart libraries. Unfortunately, those libraries do not allow for restarts with a modified number of processes or scientific post-processing of the checkpointed data. This is because they typically use an N-N checkpointing scheme and opaque file-formats. In this article, we present a novel mechanism to asynchronously store checkpoints into a selfdescriptive file format and load the data upon recovery with a different number of processes. We provide an API that defines the process-local data as part of a globally shared dataset. Our measurements demonstrate a low overhead between 0.6% and 2.5% for a 2.25 TB checkpoint with 6K processes.

show abstract

“…Resiliency, in addition to concurrency and energy efficiency, is and will be a major challenge for future high‐performance computing (HPC) architectures . Preliminary estimates suggest mean time to interrupt (MTTI) could be from a few hours to a day as the concurrency on future systems increases rapidly . Without additional effort, applications tend to be susceptible to this reduction in MTTI, resulting in potentially substantial losses in productivity for end users.…”

Section: Introductionmentioning

confidence: 99%

Application health monitoring for extreme‐scale resiliency using cooperative fault management

Agarwal

Naughton

Park

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

Summary Resiliency is and will be a critical factor in determining scientific productivity on current and exascale supercomputers, and beyond. Applications oblivious to and incapable of handling transient soft and hard errors could waste supercomputing resources or, worse, yield misleading scientific insights. We introduce a novel application‐driven silent error detection and recovery strategy based on application health monitoring. Our methodology uses application output that follows known patterns, as indicators of an application's health and knowledge that violation of these patterns could be indication of faults. Information from system monitors that report hardware and software health status is used to corroborate faults. Collectively, this information is used by a fault coordinator agent to take preventive and corrective measures by applying computational steering to an application between checkpoints. This cooperative fault management system uses the Fault Tolerance Backplane as a communication channel. The benefits of this framework are demonstrated with two real application case studies, molecular dynamics, and quantum chemistry simulations, on scalable clusters with simulated memory and I/O corruptions. The developed approach is general and can be easily applied to other applications.

show abstract

Exascale fault tolerance challenge and approaches

Cited by 4 publications

References 10 publications

Modeling Application-Level Soft Error Effects for Single-Event Multi-Bit Upsets

Modeling Application-Level Soft Error Effects for Single-Event Multi-Bit Upsets

Design and Study of Elastic Recovery in HPC Applications

Application health monitoring for extreme‐scale resiliency using cooperative fault management

Contact Info

Product

Resources

About