2013
DOI: 10.1109/tc.2012.57
|View full text |Cite
|
Sign up to set email alerts
|

Complexity Analysis of Checkpoint Scheduling with Variable Costs

Abstract: International audienceThe parallel computing platforms available today are increasingly larger and thus, more and more subject to failures. Consequently it is necessary to develop efficient strategies providing safe and reliable completion for HPC parallel applications. Checkpointing is one of the most popular and efficient technique for developing fault-tolerant applications on such context. However, checkpoint operations are costly in terms of time, computation and network communication. This will certainly … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
20
0

Year Published

2013
2013
2021
2021

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 18 publications
(21 citation statements)
references
References 31 publications
1
20
0
Order By: Relevance
“…Checkpoint-rollback-recovery is used to tolerate failures. Our main contribution over previous work [13,19] is that we consider general Directed Acyclic Graphs instead of linear chains. Our theoretical results include polynomial-time algorithms for fork DAGs and for some join DAGs (when the checkpoint and recovery costs are constant) and the intractability of the problem for join DAGs in general.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Checkpoint-rollback-recovery is used to tolerate failures. Our main contribution over previous work [13,19] is that we consider general Directed Acyclic Graphs instead of linear chains. Our theoretical results include polynomial-time algorithms for fork DAGs and for some join DAGs (when the checkpoint and recovery costs are constant) and the intractability of the problem for join DAGs in general.…”
Section: Resultsmentioning
confidence: 99%
“…Few authors have studied the resilience problem with workflows when checkpointing can only take place at the end of each task. Bouguerra et al [19] have studied a restricted version of DAGChkptSched when the workflow is a linear chain (with a single processor). They propose a greedy heuristic to minimize the total execution time in case of arbitrary failures.…”
Section: Related Workmentioning
confidence: 99%
“…Middleware checkpoint management: As seen for VM resilient operation, middleware process resiliency can also be enhanced using checkpointing. The problem to obtain optimal scheduling for checkpoint of multiple components and layers is complex (proven to be NP-hard in [189]), because checkpoint implementation might differ based on the component diversity. This is particularly challenging in large cloud infrastructures due to synchronization, upgrade, and resource management issues.…”
Section: E Resiliency In Cloud Middleware Infrastructurementioning
confidence: 99%
“…The checkpointing scheduling complexity has been analyzed in [3]. In this research, no assumption was made regarding failures distribution, and checkpointing overhead was assumed to be variable.…”
Section: Related Workmentioning
confidence: 99%