Complexity Analysis of Checkpoint Scheduling with Variable Costs

Bouguerra, Mohamed-Slim; Trystram, Denis; Wagner, Frederic H.

doi:10.1109/tc.2012.57

Cited by 18 publications

(21 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Checkpoint-rollback-recovery is used to tolerate failures. Our main contribution over previous work [13,19] is that we consider general Directed Acyclic Graphs instead of linear chains. Our theoretical results include polynomial-time algorithms for fork DAGs and for some join DAGs (when the checkpoint and recovery costs are constant) and the intractability of the problem for join DAGs in general.…”

Section: Resultsmentioning

confidence: 99%

“…Few authors have studied the resilience problem with workflows when checkpointing can only take place at the end of each task. Bouguerra et al [19] have studied a restricted version of DAGChkptSched when the workflow is a linear chain (with a single processor). They propose a greedy heuristic to minimize the total execution time in case of arbitrary failures.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Checkpointing Strategies for Scheduling Computational Workflows

Aupy

Benoît

Casanova

et al. 2016

IJNC

View full text Add to dashboard Cite

We study the scheduling of computational workflows on compute resources that experience exponentially distributed failures. When a failure occurs, rollback and recovery is used to resume the execution from the last checkpointed state. The scheduling problem is to minimize the expected execution time by deciding in which order to execute the tasks in the workflow and deciding for each task whether to checkpoint it or not after it completes. We give a polynomialtime optimal algorithm for fork DAGs (Directed Acyclic Graphs) and show that the problem is NP-complete with join DAGs. We also investigate the complexity of the simple case in which no task is checkpointed. Our main result is a polynomial-time algorithm to compute the expected execution time of a workflow, with a given task execution order and specified to-be-checkpointed tasks. Using this algorithm as a basis, we propose several heuristics for solving the scheduling problem. We evaluate these heuristics for representative workflow configurations.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Checkpointing Strategies for Scheduling Computational Workflows

Aupy

Benoît

Casanova

et al. 2016

IJNC

View full text Add to dashboard Cite

show abstract

“…Middleware checkpoint management: As seen for VM resilient operation, middleware process resiliency can also be enhanced using checkpointing. The problem to obtain optimal scheduling for checkpoint of multiple components and layers is complex (proven to be NP-hard in [189]), because checkpoint implementation might differ based on the component diversity. This is particularly challenging in large cloud infrastructures due to synchronization, upgrade, and resource management issues.…”

Section: E Resiliency In Cloud Middleware Infrastructurementioning

confidence: 99%

A Survey on Resiliency Techniques in Cloud Computing Infrastructures and Applications

Colman-Meixner

Develder

Tornatore

et al. 2016

IEEE Commun. Surv. Tutorials

130

View full text Add to dashboard Cite

Abstract-Today's businesses increasingly rely on cloud computing, which brings both great opportunities and challenges. One of the critical challenges is resiliency: disruptions due to failures (either accidental or because of disasters or attacks) may entail significant revenue losses (e.g., US$ 25.5 billion in 2010 for North America). Such failures may originate at any of the major components in a cloud architecture (and propagate to others): (i) the servers hosting the application, (ii) the network interconnecting them (on different scales, inside a data center, up to wide-area connections), or (iii) the application itself. We comprehensively survey a large body of work focusing on resilience of cloud computing, in each (or a combination) of the server, network, and application components.First, we present the cloud computing architecture and its key concepts. We highlight both the infrastructure (servers, network) and application components. A key concept is virtualization of infrastructure (i.e., partitioning into logically separate units), and thus we detail the components in both physical and virtual layers. Before moving to the detailed resilience aspects, we provide a qualitative overview of the types of failures that may occur (from the perspective of the layered cloud architecture), and their consequences.The second major part of the paper introduces and categorizes a large number of techniques for cloud computing infrastructure resiliency. This ranges from designing and operating the facilities, servers, networks, to their integration and virtualization (e.g., also including resilience of the middleware infrastructure).The third part focuses on resilience in application design and development. We study how applications are designed, installed, and replicated to survive multiple physical failure scenarios as well as disaster failures.

show abstract

“…The checkpointing scheduling complexity has been analyzed in [3]. In this research, no assumption was made regarding failures distribution, and checkpointing overhead was assumed to be variable.…”

Section: Related Workmentioning

confidence: 99%

On the Optimum Checkpointing Interval Selection for Variable Size Checkpoint Dumps

Sadi

Yagoubi

2015

IFIP Advances in Information and Communication Technology

View full text Add to dashboard Cite

Abstract. Checkpointing is a technique that is often employed for granting fault tolerance for applications executing in failure-prone environments. It consists on regularly saving the application's state in another and fault independent storage such that if the application fails, it can be continued without necessarily restarting it. In this context, fixing the checkpointing frequency is an important topic which we address in this paper. We particularly address this issue considering hybrid fault tolerance and variable size checkpoint dumps. We then evaluate our solution and compare it with state of the art models, and show that our solution brings better results.

show abstract

Complexity Analysis of Checkpoint Scheduling with Variable Costs

Cited by 18 publications

References 31 publications

Checkpointing Strategies for Scheduling Computational Workflows

Checkpointing Strategies for Scheduling Computational Workflows

A Survey on Resiliency Techniques in Cloud Computing Infrastructures and Applications

On the Optimum Checkpointing Interval Selection for Variable Size Checkpoint Dumps

Contact Info

Product

Resources

About