2017 46th International Conference on Parallel Processing (ICPP) 2017
DOI: 10.1109/icpp.2017.67
|View full text |Cite
|
Sign up to set email alerts
|

Resilience for Stencil Computations with Latent Errors

Abstract: Projections and measurements of error rates in near-exascale and exascale systems suggest a dramatic growth, due to extreme scale (10 9 cores), concurrency, software complexity, and deep submicron transistor scaling. Such a growth makes resilience a critical concern, and may increase the incidence of errors that "escape", silently corrupting application state. Such errors can often be revealed by application software tests but with long latencies, and thus are known as latent errors. We explore how to efficien… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
8
0

Year Published

2017
2017
2020
2020

Publication Types

Select...
2
2
1

Relationship

2
3

Authors

Journals

citations
Cited by 8 publications
(8 citation statements)
references
References 25 publications
0
8
0
Order By: Relevance
“…This will allow to recover from the state of the domain at that point in case of another error. Recall that an error correction mechanism specifically dedicated to stencil computations has already been presented [18], wherein authors exploit stencil locality to reduce the re-execution cost after an error has been detected. Such a method could be used to further lower the cost of re-computation in case of errors.…”
Section: Offline Error Correctionmentioning
confidence: 99%
“…This will allow to recover from the state of the domain at that point in case of another error. Recall that an error correction mechanism specifically dedicated to stencil computations has already been presented [18], wherein authors exploit stencil locality to reduce the re-execution cost after an error has been detected. Such a method could be used to further lower the cost of re-computation in case of errors.…”
Section: Offline Error Correctionmentioning
confidence: 99%
“…Recently, we have successfully applied ABFR to stencil computations [4], which are perfectly suited to ABFR due to their regular and neighborbased communication pattern. The tree-based propagation pattern of N-Body computations is much more challenging for ABFR.…”
Section: Related Workmentioning
confidence: 99%
“…We propose to use the Algorithm-Based Focused Recovery (ABFR) approach [4] for N-body computations. ABFR exploits application semantics and versioned states to bound error impact and further localize recovery.…”
Section: Algorithm-based Focused Recovery (Abfr)mentioning
confidence: 99%
See 1 more Smart Citation
“…Techniques for detecting (and in some cases correcting) silent errors have been studied for a variety of iterative solvers, [31][32][33][34][35] adaptive numerical integrators, [36][37][38] and other widely used computational approaches. 5,39,40 Clearly, not all application outputs would be useful as health indicators, but our experience, informed by discussions with many colleagues, is that most computational scientists develop techniques to check their simulations to determine whether the results are believable. Many such checks could be automated and applied throughout a simulation using our approach.…”
mentioning
confidence: 99%