2013
DOI: 10.1155/2013/473915
|View full text |Cite
|
Sign up to set email alerts
|

Containment Domains: A Scalable, Efficient and Flexible Resilience Scheme for Exascale Systems

Abstract: This paper describes and evaluates a scalable and efficient resilience scheme based on the concept of containment domains. Containment domains are a programming construct that enable applications to express resilience needs and to interact with the system to tune and specialize error detection, state preservation and restoration, and recovery schemes. Containment domains have weak transactional semantics and are nested to take advantage of the machine and application hierarchies and to enable hierarchical stat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
22
0

Year Published

2014
2014
2020
2020

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 24 publications
(22 citation statements)
references
References 23 publications
0
22
0
Order By: Relevance
“…Formalizations targeting resilience can be found in [17,32]. Containment domains for encapsulating failures within a hierarchical scope are discussed in [13]. Modeling and prediction of failures is addressed in [8,13].…”
Section: Other Important Studies and Discussionmentioning
confidence: 99%
“…Formalizations targeting resilience can be found in [17,32]. Containment domains for encapsulating failures within a hierarchical scope are discussed in [13]. Modeling and prediction of failures is addressed in [8,13].…”
Section: Other Important Studies and Discussionmentioning
confidence: 99%
“…The protection domain of the pattern extends to the scope of the primary system, i.e., the scope for which the recovery block is created. Examples of the recovery block pattern in HPC include the Containment Domains (CD) [14] programming construct, which provides a recovery routine initiated upon detection of an error in the execution of the block of code encapsulated by the CD. This enables the CD to constrain the detection and correction of errors to the boundary of the domain.…”
Section: Recovery Block Patternmentioning
confidence: 99%
“…In the case of memory, a DUE indicates that some data has been lost. This loss of data could be acknowledged and tolerated (by an error tolerant application), it may be corrected by some higher-level protection mechanism (such as checkpoint and restart or a hierarchical state preservation and restoration runtime system [26]), or it may indicate a fail-stop condition where forward progress is halted (but no silent data corruption occurs). Section 5 evaluates Bamboo ECC in the context of machines with high-level state recovery facilities, as many highperformance and high-availability machines have system-level recovery mechanisms in place.…”
Section: Reliabilitymentioning
confidence: 99%