2012 International Conference for High Performance Computing, Networking, Storage and Analysis 2012
DOI: 10.1109/sc.2012.36
|View full text |Cite
|
Sign up to set email alerts
|

Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
32
0

Year Published

2015
2015
2020
2020

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 51 publications
(32 citation statements)
references
References 22 publications
0
32
0
Order By: Relevance
“…Since much is unsure about future fault tolerance solutions, there is active research into programming abstractions for resilience. For example, Chung et al [39] present containment domains: a programming construct that enables programmers to explicitly define the fault tolerance requirements for sections of code. Rolex [95] is a C/C++ language extension that incorporates resilience into application code.…”
Section: Programming Abstractionsmentioning
confidence: 99%
“…Since much is unsure about future fault tolerance solutions, there is active research into programming abstractions for resilience. For example, Chung et al [39] present containment domains: a programming construct that enables programmers to explicitly define the fault tolerance requirements for sections of code. Rolex [95] is a C/C++ language extension that incorporates resilience into application code.…”
Section: Programming Abstractionsmentioning
confidence: 99%
“…We have not explored fully the semantics and applications of nested failure handling in FTA; however, when provided, it would allow users to handle a variety of failures, including silent data corruption detected in the application inside a code block. This would allow to apply concepts, such as containment domains [6], in MPI applications.…”
Section: Limitations and Future Workmentioning
confidence: 99%
“…These include projects such as MPICH-V [3], which combined lightweight checkpointing with message logging. Other new advances include Containment Domains [5], which allow the application to prevent errors in one part of the system from affecting others. Our work is in line with this last concept, presenting a heavily optimized solution for recovery on coprocessors.…”
Section: Related Workmentioning
confidence: 99%