2012 IEEE 26th International Parallel and Distributed Processing Symposium 2012
DOI: 10.1109/ipdps.2012.111
|View full text |Cite
|
Sign up to set email alerts
|

HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications

Abstract: Abstract-High performance computing will probably reach exascale in this decade. At this scale, mean time between failures is expected to be a few hours. Existing fault tolerant protocols for message passing applications will not be efficient anymore since they either require a global restart after a failure (checkpointing protocols) or result in huge memory occupation (message logging). Hybrid fault tolerant protocols overcome these limits by dividing applications processes into clusters and applying a differ… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
48
0

Year Published

2013
2013
2015
2015

Publication Types

Select...
5
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 34 publications
(48 citation statements)
references
References 26 publications
0
48
0
Order By: Relevance
“…This induces an overhead, which we express as a slowdown of the execution rate: instead of executing one work-unit per second, the application executes only λ work-units, where 0 < λ < 1. Typical values for λ are said to be λ ≈ 0.98, meaning that the overhead due to payload messages is only a small percentage [36,13]. On the contrary, message logging has a positive effect on re-execution after a failure, because intergroup messages are stored in memory and directly accessible after the recovery.…”
Section: Refining the Modelmentioning
confidence: 99%
See 1 more Smart Citation
“…This induces an overhead, which we express as a slowdown of the execution rate: instead of executing one work-unit per second, the application executes only λ work-units, where 0 < λ < 1. Typical values for λ are said to be λ ≈ 0.98, meaning that the overhead due to payload messages is only a small percentage [36,13]. On the contrary, message logging has a positive effect on re-execution after a failure, because intergroup messages are stored in memory and directly accessible after the recovery.…”
Section: Refining the Modelmentioning
confidence: 99%
“…Typically, these groups are composed to take advantage of the application communication pattern [36,32]. For instance, if the application executes on a 2D-grid of processors, a natural way to create processor groups is to have one group per row (or column) of the grid.…”
Section: Case Studiesmentioning
confidence: 99%
“…Several novel fault tolerance protocols overcome this limitation by reducing significantly the number of messages to log. They fall into the class of hierarchical fault tolerance protocols, forming clusters of processes and using coordinated checkpointing inside clusters and and message logging between clusters [61,77,84]. Such protocols need to manage causal dependencies between processes in order to ensure correct recovery.…”
Section: Toward Exascale Resilience: 2014 Updatementioning
confidence: 99%
“…Section 0.7 is mainly based on [12] which contains references to many studies of checkpointing policies. For non-coordinated checkpointing protocols, [28] by Guermouche et al is a good starting point.…”
Section: Further Informationmentioning
confidence: 99%