2012 International Conference for High Performance Computing, Networking, Storage and Analysis 2012
DOI: 10.1109/sc.2012.77
|View full text |Cite
|
Sign up to set email alerts
|

MCREngine: A scalable checkpointing system using data-aware aggregation and compression

Abstract: Abstract-High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure.We alleviate this problem through a scalable checkpointrestart system, MCRENGINE.… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
5
0

Year Published

2016
2016
2022
2022

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 38 publications
(5 citation statements)
references
References 27 publications
0
5
0
Order By: Relevance
“…Incremental checkpointing dynamically identifies the changed blocks of memory since the last checkpoint through a hash function [12] in order to limit the amount of state required to be captured per checkpoint. Data aggregation and compression also help reduce the bandwidth requirements when committing the checkpoint to disk [104]. Plank et al eliminate the overhead of writing checkpoints to disk altogether with a diskless in-memory checkpointing approach [150].…”
Section: Operating System and Runtime-based Solutionsmentioning
confidence: 99%
“…Incremental checkpointing dynamically identifies the changed blocks of memory since the last checkpoint through a hash function [12] in order to limit the amount of state required to be captured per checkpoint. Data aggregation and compression also help reduce the bandwidth requirements when committing the checkpoint to disk [104]. Plank et al eliminate the overhead of writing checkpoints to disk altogether with a diskless in-memory checkpointing approach [150].…”
Section: Operating System and Runtime-based Solutionsmentioning
confidence: 99%
“…Finally, compression-based techniques use standard compression algorithms to reduce checkpoint volumes [29] and can be used at the compiler-level [30] or in-memory [31]. Related, Plank et al proposed differential compression to reduce checkpoint sizes for incremental checkpoints [32] and Tanzima et al show that similarities amongst checkpoint data from different processes can be exploited to compress and reduce checkpoint data volumes [33]. Finally, Sasaki et al propose a lossy compression method based on wavelet transform and vector quantization to the checkpoints of a production climate application [34], while Ni et al [35] study the trade-offs between the loss of precision, compression ratio, and application correctness due to lossy compression.…”
Section: Related Workmentioning
confidence: 99%
“…Finally, compression-based techniques use standard compression algorithms to reduce checkpoint volumes [27] and can be used at the compiler-level [28] or in-memory [29]. Related, Plank et al proposed differential compression to reduce checkpoint sizes for incremental checkpoints [30] and Tanzima et al show that similarities amongst checkpoint data from different processes can be exploited to compress and reduce checkpoint data volumes [31]. Finally, Sasaki et al propose a lossy compression method based on wavelet transform and vector quantization to the checkpoints of a production climate application [32], while Ni et al [33] study the trade-offs between the loss of precision, compression ratio, and application correctness due to lossy compression.…”
Section: Related Workmentioning
confidence: 99%