Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale 2017
DOI: 10.1145/3086157.3086165
|View full text |Cite
|
Sign up to set email alerts
|

Optimal Checkpointing Period with Replicated Execution on Heterogeneous Platforms

Abstract: In this paper, we design and analyze strategies to replicate the execution of an application on two di erent platforms subject to failures, using checkpointing on a shared stable storage. We derive the optimal pa ern size W for a periodic checkpointing strategy where both platforms concurrently try and executeW units of work before checkpointing. e rst platform that completes its pa ern takes a checkpoint, and the other platform interrupts its execution to synchronize from that checkpoint. We compare this stra… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2017
2017
2021
2021

Publication Types

Select...
3
1

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(1 citation statement)
references
References 20 publications
0
1
0
Order By: Relevance
“…As each of the basic redundancies has its own drawbacks and merits, some researchers have attempted to obtain the appropriate combination of those redundancies . In , based on combining replication and resubmission redundancies, the Resubmission Impact ( RI ) heuristic is proposed.…”
Section: Related Studiesmentioning
confidence: 99%
“…As each of the basic redundancies has its own drawbacks and merits, some researchers have attempted to obtain the appropriate combination of those redundancies . In , based on combining replication and resubmission redundancies, the Resubmission Impact ( RI ) heuristic is proposed.…”
Section: Related Studiesmentioning
confidence: 99%