A first order approximation to the optimum checkpoint interval

Young, John W.

doi:10.1145/361147.361115

Cited by 549 publications

(437 citation statements)

References 0 publications

Supporting

Mentioning

424

Contrasting

Unclassified

Order By: Relevance

“…Let the time interval between checkpoints be T c , the time to save checkpoint information be T s , and the mean time between failures (MTBF) be T f . Then, the optimal checkpoint rate is T c = 2 × T s × T f [53]. We also observed that the mean checkpoint time (T s ) for BT, CG, FT, LU and SP with class C inputs on 4, 8 or 9 and 16 nodes is 23 seconds on the same experimental cluster [51].…”

Section: G Proactive Ft Complements Reactive Ftmentioning

confidence: 59%

Proactive process-level live migration and back migration in HPC environments

Wang

Mueller

Engelmann

et al. 2012

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission.This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 seconds of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 seconds. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively.

show abstract

Section: G Proactive Ft Complements Reactive Ftmentioning

confidence: 59%

Proactive process-level live migration and back migration in HPC environments

Wang

Mueller

Engelmann

et al. 2012

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

show abstract

“…We derive that C i,j = mi jτ + β. As for the checkpointing period τ i,j , we use Young's formula [17] and let…”

Section: Fault Modelmentioning

confidence: 99%

Resilient co-scheduling of malleable applications

Benoît

Pottier

Robert

2017

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

Recently, the benefits of co-scheduling several applications have been demonstrated in a fault-free context, both in terms of performance and energy savings. However, large-scale computer systems are confronted by frequent failures, and resilience techniques must be employed for large applications to execute efficiently. Indeed, failures may create severe imbalance between applications, and significantly degrade performance. In this paper, we aim at minimizing the expected completion time of a set of co-scheduled applications. We propose to redistribute the resources assigned to each application upon the occurrence of failures, and upon the completion of some applications, in order to achieve this goal. First, we introduce a formal model and establish complexity results. The problem is NP-complete for malleable applications, even in a faultfree context. Therefore, we design polynomial-time heuristics that perform redistributions and account for processor failures. A fault simulator is used to perform extensive simulations that demonstrate the usefulness of redistribution and the performance of the proposed heuristics.

show abstract

“…Many models are available to understand the behavior of checkpoint/restart [19,20,21,22], and thereby to define an optimal checkpoint period. [23] proposes a scalability model to evaluate the impact of failures on application performance.…”

Section: Related Workmentioning

confidence: 99%

Composing resilience techniques: ABFT, periodic and incremental checkpointing

Bosilca

Bouteiller

Hérault

et al. 2015

IJNC

View full text Add to dashboard Cite

Algorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalability and performance in failure-prone environments. Thanks to recent advances in the understanding of the involved mechanisms, a growing number of important algorithms (including all widely used factorizations) have been proven ABFT-capable. In the context of larger applications, these algorithms provide a temporal section of the execution, where the data is protected by its own intrinsic properties, and can therefore be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only practical fault-tolerance approach for these applications is checkpoint/restart. In this paper we propose a model to investigate the efficiency of a composite protocol, that alternates between ABFT and checkpoint/restart for the effective protection of an iterative application composed of ABFTaware and ABFT-unaware sections. We also consider an incremental checkpointing composite approach in which the algorithmic knowledge is leveraged by a novel optimal dynamic programming to compute checkpoint dates. We validate these models using a simulator. The model and simulator show that the composite approach drastically increases the performance delivered by an execution platform, especially at scale, by providing the means to increase the interval between checkpoints while simultaneously decreasing the volume of each checkpoint.

show abstract

A first order approximation to the optimum checkpoint interval

Cited by 549 publications

References 0 publications

Proactive process-level live migration and back migration in HPC environments

Proactive process-level live migration and back migration in HPC environments

Resilient co-scheduling of malleable applications

Composing resilience techniques: ABFT, periodic and incremental checkpointing

Contact Info

Product

Resources

About