Checkpointing strategies for parallel jobs

Bougeret, Marin; Casanova, Henri; Rabie, Mikaël; Robert, Yves; Vivien, Frédéric

doi:10.1145/2063384.2063428

Cited by 74 publications

(88 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, in the optimal case, the number of checkpoints equals the number of failures, which equals the number of recoveries. There are various works that define optimal checkpoint intervals [28], [29]. Finally, we assume that checkpoint commit is synchronous; that is, the primary application process is paused during the commit operation and is not resumed until checkpoint commit is complete.…”

Section: A Checkpoint Compression Viability Modelmentioning

confidence: 99%

On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance

Ibtesham

Arnold

Bridges

et al. 2012

2012 41st International Conference on Parallel Processing

View full text Add to dashboard Cite

Abstract-The increasing size and complexity of high performance computing (HPC) systems have lead to major concerns over fault frequencies and the mechanisms necessary to tolerate these faults. Previous studies have shown that state-of-the-field checkpoint/restart mechanisms will not scale sufficiently for future generation systems. Therefore, optimizations that reduce checkpoint overheads are necessary to keep checkpoint/restart mechanisms effective. In this work, we demonstrate that checkpoint data compression is a feasible mechanism for reducing checkpoint commit latency and storage overheads. Leveraging a simple model for checkpoint compression viability, we show: (1) checkpoint data compression is feasible for many types of scientific applications expected to run on extreme scale systems; (2) checkpoint compression viability scales with checkpoint size; (3) user-level versus system-level checkpoints bears little impact on checkpoint compression viability; and (4) checkpoint compression viability scales with application process count. Lastly, we describe the impact checkpoint compression might have on projected extreme scale systems.

show abstract

Section: A Checkpoint Compression Viability Modelmentioning

confidence: 99%

On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance

Ibtesham

Arnold

Bridges

et al. 2012

2012 41st International Conference on Parallel Processing

View full text Add to dashboard Cite

show abstract

“…The first step is to generate a fault distribution: we use an existing fault simulator developed in [21,22]. In our case, we use this simulator with an exponential law of parameter λ.…”

Section: Simulation Settingsmentioning

confidence: 99%

Resilient co-scheduling of malleable applications

Benoît

Pottier

Robert

2017

The International Journal of High Performance Computing Applica

Self Cite

View full text Add to dashboard Cite

Recently, the benefits of co-scheduling several applications have been demonstrated in a fault-free context, both in terms of performance and energy savings. However, large-scale computer systems are confronted by frequent failures, and resilience techniques must be employed for large applications to execute efficiently. Indeed, failures may create severe imbalance between applications, and significantly degrade performance. In this paper, we aim at minimizing the expected completion time of a set of co-scheduled applications. We propose to redistribute the resources assigned to each application upon the occurrence of failures, and upon the completion of some applications, in order to achieve this goal. First, we introduce a formal model and establish complexity results. The problem is NP-complete for malleable applications, even in a faultfree context. Therefore, we design polynomial-time heuristics that perform redistributions and account for processor failures. A fault simulator is used to perform extensive simulations that demonstrate the usefulness of redistribution and the performance of the proposed heuristics.

show abstract

“…Many models are available to understand the behavior of checkpoint/restart [19,20,21,22], and thereby to define an optimal checkpoint period. [23] proposes a scalability model to evaluate the impact of failures on application performance.…”

Section: Related Workmentioning

confidence: 99%

“…, t j , and to checkpoint after t j , without any intermediate checkpoint, and knowing that a checkpoint has been taken after task t i−1 . To the best of our knowledge, the expectation E(W, C) of the time needed to successfully compute during W seconds and then take a checkpoint of duration C is known only for Exponentially distributed failures; from [22], we know that:…”

Section: Optimal Incremental Checkpointing Strategymentioning

confidence: 99%

Composing resilience techniques: ABFT, periodic and incremental checkpointing

Bosilca

Bouteiller

Hérault

et al. 2015

IJNC

Self Cite

View full text Add to dashboard Cite

Algorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalability and performance in failure-prone environments. Thanks to recent advances in the understanding of the involved mechanisms, a growing number of important algorithms (including all widely used factorizations) have been proven ABFT-capable. In the context of larger applications, these algorithms provide a temporal section of the execution, where the data is protected by its own intrinsic properties, and can therefore be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only practical fault-tolerance approach for these applications is checkpoint/restart. In this paper we propose a model to investigate the efficiency of a composite protocol, that alternates between ABFT and checkpoint/restart for the effective protection of an iterative application composed of ABFTaware and ABFT-unaware sections. We also consider an incremental checkpointing composite approach in which the algorithmic knowledge is leveraged by a novel optimal dynamic programming to compute checkpoint dates. We validate these models using a simulator. The model and simulator show that the composite approach drastically increases the performance delivered by an execution platform, especially at scale, by providing the means to increase the interval between checkpoints while simultaneously decreasing the volume of each checkpoint.

show abstract

Checkpointing strategies for parallel jobs

Cited by 74 publications

References 24 publications

On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance

On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance

Resilient co-scheduling of malleable applications

Composing resilience techniques: ABFT, periodic and incremental checkpointing

Contact Info

Product

Resources

About