2015
DOI: 10.1007/s11227-015-1422-z
|View full text |Cite
|
Sign up to set email alerts
|

A framework for evaluating comprehensive fault resilience mechanisms in numerical programs

Abstract: As HPC systems approach Exascale, their circuit features will shrink while their overall size will grow, both at a fixed power limit. These trends imply that soft faults in electronic circuits will become an increasingly significant problem for programs that run on these systems, causing them to occasionally crash or worse, silently return incorrect results. This is motivating extensive work on program resilience to such faults, ranging from generic mechanisms such as replication or checkpoint/restart to algor… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2016
2016
2020
2020

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(1 citation statement)
references
References 21 publications
0
1
0
Order By: Relevance
“…A 30% overhead in sequential performance clearly sets us back a few generations in terms of Moore's law. The detailed study of resilience solutions and their overheads provided in [5] emphasizes some of these points. Another significant drawback of these detection schemes is that they have false-positive rates that are much higher than the rates at which faults themselves occur, potentially causing unnecessary recomputations.…”
Section: Introductionmentioning
confidence: 99%
“…A 30% overhead in sequential performance clearly sets us back a few generations in terms of Moore's law. The detailed study of resilience solutions and their overheads provided in [5] emphasizes some of these points. Another significant drawback of these detection schemes is that they have false-positive rates that are much higher than the rates at which faults themselves occur, potentially causing unnecessary recomputations.…”
Section: Introductionmentioning
confidence: 99%