2015
DOI: 10.1007/978-3-319-20943-2_4
|View full text |Cite
|
Sign up to set email alerts
|

Using Replication for Resilience on Exascale Systems

Abstract: High performance computing applications must be tolerant to faults, which are common occurrences especially in post-petascale settings. The traditional fault-tolerance solution is checkpoint-rollback, by which the application saves its state to secondary storage throughout execution and recover from the latest saved state in case of a failure. An oft studied research question is that of the optimal checkpointing strategy: when should checkpoints be saved. Unfortunately, even using an optimal checkpointing stra… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
10
0

Year Published

2015
2015
2023
2023

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 11 publications
(11 citation statements)
references
References 33 publications
1
10
0
Order By: Relevance
“…We use the term core to represent the computing resource allocation unit [28]. We further use P (σ, w, t) to denote a process executing at rate σ to complete a workload w by time t. The basic tenet of Lazy Shadowing is the concept of shadowing, whereby each process is associated with a lazy replica.…”
Section: Lazy Shadowingmentioning
confidence: 99%
See 2 more Smart Citations
“…We use the term core to represent the computing resource allocation unit [28]. We further use P (σ, w, t) to denote a process executing at rate σ to complete a workload w by time t. The basic tenet of Lazy Shadowing is the concept of shadowing, whereby each process is associated with a lazy replica.…”
Section: Lazy Shadowingmentioning
confidence: 99%
“…The impact of process replication on MNFTI has been studied in [28]. Our problem is equivalent to that, with the difference that our work can tolerate one failure in each shadowed set while [28] can tolerate one failure in each replica-group of size 2.…”
Section: Application Failure Probabilitymentioning
confidence: 99%
See 1 more Smart Citation
“…Replication remains the most transparent and least intrusive technique and can be used at different levels (duplication, triplication or even more) . Combined with checkpointing, replication comes with two flavors: process replication [24,25] and group replication [26]. Process replication applies to message-passing applications with communicating processes.…”
Section: Related Workmentioning
confidence: 99%
“…When transparent replication is not (yet) provided by the runtime system, one solution could be to implement it explicitly within the application, but this is a labor-intensive process especially for legacy applications. Another approach introduced in [15] is group replication, a technique that can be used whenever process replication is not available. Group replication is agnostic to the parallel programming model, and thus views the application as an unmodified black box.…”
Section: Introductionmentioning
confidence: 99%