On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance

Ibtesham, Dewan; Arnold, Dorian; Bridges, Patrick G.; Ferreira, Kurt Brian; Brightwell, Ron

doi:10.1109/icpp.2012.45

Cited by 40 publications

(26 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…NVRAM [21,8]), methods which decrease the time to write each individual checkpoint (e.g. incremental checkpointing [3,37,1,12], multi-level checkpointing [44,31,27], remote checkpointing [42,45], and checkpoint compression [17]), and methods that decrease the number of checkpoints that must be taken per unit time (e.g. replication [10]).…”

Section: Related Workmentioning

confidence: 99%

Evaluating energy savings for checkpoint/restart

Mills

Grant

Ferreira

et al. 2013

Proceedings of the 1st International Workshop on Energy Efficient Supercomputing

Self Cite

View full text Add to dashboard Cite

Section: Related Workmentioning

confidence: 99%

Evaluating energy savings for checkpoint/restart

Mills

Grant

Ferreira

et al. 2013

Proceedings of the 1st International Workshop on Energy Efficient Supercomputing

Self Cite

View full text Add to dashboard Cite

“…In previous work, we developed a checkpoint compression viability model based on compression factor, compression speed and I/O bandwidth that outputs when checkpoint data compression yields performance improvements [1]. We evaluated the impact of checkpoint compression on overall application performance using an extension of Daly's model.…”

Section: Why Gpu-based Checkpoint Compression?mentioning

confidence: 99%

Abstract: Comparing GPU and Increment-Based Checkpoint Compression

Ibtesham

Arnold

Ferreira

et al. 2012

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

View full text Add to dashboard Cite

I. WHY GPU-BASED CHECKPOINT COMPRESSION?Checkpoint/restart protocols periodically record the address space state of all processes in an application execution instance to stable storage. Upon failures, new incarnations of failed processes are recovered from the failed processes' most recent checkpoints. Various strategies have been explored for improving checkpoint/restart efficiency including strategies that hide or reduce checkpoint commit latencies, for example by reducing checkpoint sizes. One such optimization, increment-based checkpointing, only saves the incremental changes in a process's address space between subsequent checkpoints.In previous work, we developed a checkpoint compression viability model based on compression factor, compression speed and I/O bandwidth that outputs when checkpoint data compression yields performance improvements [1]. We evaluated the impact of checkpoint compression on overall application performance using an extension of Daly's model. This evaluation was based on CPU-based checkpoint compression performance and demonstrated that checkpoint data compression can improve an application makespan significantly. Now, we compare compression-based and increment-based optimizations and begin to explore how GPU-based checkpoint compression might further improve checkpoint/restart performance. Questions we wish to answer include:• How do compression-based and increment-based checkpoints optimizations compare? • Does the combination of compression-based and increment-based optimizations yield further improvements? • Can faster, GPU-based compression algorithms improve checkpoint compression viability and, as a result, improve application makespan? II. METHODOLOGYWe collected checkpoint compression performance data using the following setup 1 : • Applications: We performed our experiments with a set of mini apps from the Mantevo Project namely HPCCG, 1 For detailed references about our experimental setup we refer to our previous study[1]. pHPCCG, phdMesh and miniFE along with LAMMPS, a key simulation workload for Department of Energy. • Checkpoint Libraries: We used BLCR as our system level checkpoint library to generate checkpoints at a small interval uniformly distributed over the application runs. We also used LAMMPS' capability of generating checkpoints and generated checkpoints using the builtin checkpoint library. • Compression Utilities: We chose popular compression tools from linux's software stack for example parallel bzip, bzip, zip, rzip, 7zip etc and a parallel CUDAbased compression algorithm GFC[2] as our GPUbased compression routine.We fed collected data into our our application efficiency model, which now includes increment-based checkpointing. The modified model takes two additional parameters -the number of increment-based checkpoints between two full checkpoints and the ratio between the size of an incrementbased checkpoint and a full checkpoint. We assume each checkpoint increment is 1/5 th the size of a regular checkpoint, an optimal number of increments between check...

show abstract

“…In the past, a number of technologies have been presented to improve fault tolerance (FT) of large-scale systems, and new resilience techniques are emerging to ad dress new challenges posed by extreme-scale computing [5], [6], [7], [8], [9]. The advancement of resilience technologies, however, greatly depends on a deeper understanding of fa ults arising from hardware/software components.…”

Section: Introductionmentioning

confidence: 99%

Exploring void search for fault detection on extreme scale systems

Berrocal

Wallace

Papka

et al. 2014

2014 IEEE International Conference on Cluster Computing (CLUSTER)

View full text Add to dashboard Cite

Mean Time Between Failures (MTBF), now cal culated in days or hours, is expected to drop to minutes on exascale machines. The advancement of resilience technologies greatly depends on a deeper understanding of faults arising from hardware and software components. This understanding has the potential to help us build better fault tolerance technologies. For instance, it has been proved that combining checkpointing and failure prediction leads to longer checkpoint intervals, which in turn leads to fewer total checkpoints. In this paper we present a new approach for fault detection based on the Void Search (VS)algorithm. VS is used primarily in astrophysics for finding areas of space that have a very low density of galaxies. We evaluate our algorithm using real environmental logs from Mira Blue Gene/Q supercomputer at Argonne National Laboratory. Our experiments show that our approach can detect almost all faults (i.e., sensitivity close to 1) with a low false positive rate (i.e., specificity values above 0.7). We also compare our algorithm with a number of existing detection algorithms, and find that ours outperforms all of them.

show abstract

On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance

Cited by 40 publications

References 26 publications

Evaluating energy savings for checkpoint/restart

Evaluating energy savings for checkpoint/restart

Abstract: Comparing GPU and Increment-Based Checkpoint Compression

Exploring void search for fault detection on extreme scale systems

Contact Info

Product

Resources

About