2012 41st International Conference on Parallel Processing 2012
DOI: 10.1109/icpp.2012.45
|View full text |Cite
|
Sign up to set email alerts
|

On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance

Abstract: Abstract-The increasing size and complexity of high performance computing (HPC) systems have lead to major concerns over fault frequencies and the mechanisms necessary to tolerate these faults. Previous studies have shown that state-of-the-field checkpoint/restart mechanisms will not scale sufficiently for future generation systems. Therefore, optimizations that reduce checkpoint overheads are necessary to keep checkpoint/restart mechanisms effective. In this work, we demonstrate that checkpoint data compressi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
24
0
1

Year Published

2012
2012
2024
2024

Publication Types

Select...
4
3
3

Relationship

1
9

Authors

Journals

citations
Cited by 40 publications
(26 citation statements)
references
References 26 publications
1
24
0
1
Order By: Relevance
“…NVRAM [21,8]), methods which decrease the time to write each individual checkpoint (e.g. incremental checkpointing [3,37,1,12], multi-level checkpointing [44,31,27], remote checkpointing [42,45], and checkpoint compression [17]), and methods that decrease the number of checkpoints that must be taken per unit time (e.g. replication [10]).…”
Section: Related Workmentioning
confidence: 99%
“…NVRAM [21,8]), methods which decrease the time to write each individual checkpoint (e.g. incremental checkpointing [3,37,1,12], multi-level checkpointing [44,31,27], remote checkpointing [42,45], and checkpoint compression [17]), and methods that decrease the number of checkpoints that must be taken per unit time (e.g. replication [10]).…”
Section: Related Workmentioning
confidence: 99%
“…In previous work, we developed a checkpoint compression viability model based on compression factor, compression speed and I/O bandwidth that outputs when checkpoint data compression yields performance improvements [1]. We evaluated the impact of checkpoint compression on overall application performance using an extension of Daly's model.…”
Section: Why Gpu-based Checkpoint Compression?mentioning
confidence: 99%
“…In the past, a number of technologies have been presented to improve fault tolerance (FT) of large-scale systems, and new resilience techniques are emerging to ad dress new challenges posed by extreme-scale computing [5], [6], [7], [8], [9]. The advancement of resilience technologies, however, greatly depends on a deeper understanding of fa ults arising from hardware/software components.…”
Section: Introductionmentioning
confidence: 99%