MCREngine: A scalable checkpointing system using data-aware aggregation and compression

Islam, Tanzima; Mohror, Kathryn; Bagchi, Saurabh; Moody, Adam; Supinski, B de; Eigenmann, Rudolf

doi:10.1109/sc.2012.77

Cited by 38 publications

(5 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Incremental checkpointing dynamically identifies the changed blocks of memory since the last checkpoint through a hash function [12] in order to limit the amount of state required to be captured per checkpoint. Data aggregation and compression also help reduce the bandwidth requirements when committing the checkpoint to disk [104]. Plank et al eliminate the overhead of writing checkpoints to disk altogether with a diskless in-memory checkpointing approach [150].…”

Section: Operating System and Runtime-based Solutionsmentioning

confidence: 99%

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (V.2.0)

Engelmann¹,

Ashraf²,

Hukerikar³

et al. 2022

View full text Add to dashboard Cite

show abstract

Section: Operating System and Runtime-based Solutionsmentioning

confidence: 99%

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (V.2.0)

Engelmann¹,

Ashraf²,

Hukerikar³

et al. 2022

View full text Add to dashboard Cite

show abstract

“…Finally, compression-based techniques use standard compression algorithms to reduce checkpoint volumes [29] and can be used at the compiler-level [30] or in-memory [31]. Related, Plank et al proposed differential compression to reduce checkpoint sizes for incremental checkpoints [32] and Tanzima et al show that similarities amongst checkpoint data from different processes can be exploited to compress and reduce checkpoint data volumes [33]. Finally, Sasaki et al propose a lossy compression method based on wavelet transform and vector quantization to the checkpoints of a production climate application [34], while Ni et al [35] study the trade-offs between the loss of precision, compression ratio, and application correctness due to lossy compression.…”

Section: Related Workmentioning

confidence: 99%

Checkpointing Strategies for Shared High-Performance Computing Platforms

Hérault

Robert

Bouteiller

et al. 2019

IJNC

View full text Add to dashboard Cite

Input/output (I/O) from various sources often contend for scarcely available bandwidth. For example, checkpoint/restart (CR) protocols can help to ensure application progress in failureprone environments. However, CR I/O alongside an application's normal, requisite I/O can increase I/O contention and might negatively impact performance. In this work, we consider different aspects (system-level scheduling policies and hardware) that optimize the overall performance of concurrently executing CR-based applications that share I/O resources. We provide a theoretical model and derive a set of necessary constraints to minimize the global waste on a given platform. Our results demonstrate that Young/Daly's optimal checkpoint interval, despite providing a sensible metric for a single, undisturbed application, is not sufficient to optimally address resource contention at scale. We show that by combining optimal checkpointing periods with contention-aware system-level I/O scheduling strategies, we can significantly improve overall application performance and maximize the platform throughput. Finally, we evaluate how specialized hardware, namely burst buffers, may help to mitigate the I/O contention problem. Overall, these results provide critical analysis and direct guidance on how to design efficient, CR ready, large-scale platforms without a large investment in the I/O subsystem.

show abstract

“…Finally, compression-based techniques use standard compression algorithms to reduce checkpoint volumes [27] and can be used at the compiler-level [28] or in-memory [29]. Related, Plank et al proposed differential compression to reduce checkpoint sizes for incremental checkpoints [30] and Tanzima et al show that similarities amongst checkpoint data from different processes can be exploited to compress and reduce checkpoint data volumes [31]. Finally, Sasaki et al propose a lossy compression method based on wavelet transform and vector quantization to the checkpoints of a production climate application [32], while Ni et al [33] study the trade-offs between the loss of precision, compression ratio, and application correctness due to lossy compression.…”

Section: Related Workmentioning

confidence: 99%

Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms

Hérault

Robert²,

Bouteiller³

et al. 2018

2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

View full text Add to dashboard Cite

In high-performance computing environments, input/output (I/O) from various sources often contend for scare available bandwidth. Adding to the I/O operations inherent to the failure-free execution of an application, I/O from checkpoint/restart (CR) operations (used to ensure progress in the presence of failures) places an additional burden as it increases I/O contention, leading to degraded performance. In this work, we consider a cooperative scheduling policy that optimizes the overall performance of concurrently executing CR-based applications which share valuable I/O resources. First, we provide a theoretical model and then derive a set of necessary constraints needed to minimize the global waste on the platform. Our results demonstrate that the optimal checkpoint interval as defined by Young/Daly, while providing a sensible metric for a single application, is not sufficient to optimally address resource contention at the platform scale. We therefore show that combining optimal checkpointing periods with I/O scheduling strategies can provide a significant improvement on the overall application performance, thereby maximizing platform throughput. Overall, these results provide critical analysis and direct guidance on checkpointing large-scale workloads in the presence of competing I/O while minimizing the impact on application performance.

show abstract

MCREngine: A scalable checkpointing system using data-aware aggregation and compression

Cited by 38 publications

References 27 publications

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (V.2.0)

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (V.2.0)

Checkpointing Strategies for Shared High-Performance Computing Platforms

Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms

Contact Info

Product

Resources

About