2013
DOI: 10.1155/2013/341672
|View full text |Cite
|
Sign up to set email alerts
|

McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression

Abstract: High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpoint-restart system, mcrEngine. McrEng… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
24
0

Year Published

2014
2014
2019
2019

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 25 publications
(24 citation statements)
references
References 23 publications
0
24
0
Order By: Relevance
“…Islam et al [39] present a checkpoint-restart library that works to coalesce requests from the multiple processes to the PFS. They use information -variables name and type -to place similar data close together.…”
Section: Requests Aggregation and Reorderingmentioning
confidence: 99%
“…Islam et al [39] present a checkpoint-restart library that works to coalesce requests from the multiple processes to the PFS. They use information -variables name and type -to place similar data close together.…”
Section: Requests Aggregation and Reorderingmentioning
confidence: 99%
“…If an algorithm lacks a simple checking method or invariant, the Checker can be provided through comparison with a checksum over the data that was computed beforehand and stored in a safe region. 2 The Recover method can be supplied through the forward recovery phase in ABFT methods, or simply by restoring a light-weight deduplicated [1] or compressed [17] checkpoint of the data.…”
Section: Assumptionsmentioning
confidence: 99%
“…In [28], it showed that data compression had the potential to significantly reduce the checkpointing file sizes. If multiple applications run concurrently, a dataaware compression scheme [29] was proposed to improve the overall checkpointing efficiency. Recent study [30] shows that combining failure detection and proactive checkpointing could improve 30% efficiency compared to classical periodical checkpointing.…”
Section: Related Workmentioning
confidence: 99%