Two-Level Incremental Checkpoint Recovery Scheme for Reducing System Total Overheads

Li, Huixian; Pang, Liaojun; Wang, Zhangquan

doi:10.1371/journal.pone.0104591

Cited by 10 publications

(1 citation statement)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Among the variety of recovery techniques explored by modeling, simulation and experiments at scale, the de-facto one in HPC is checkpointing and rollback, in which the state of the parallel application is saved at successive time instant over the computing time. Being the single-level coordinated checkpoint scheme the most implemented one [7][8][9], other sophisticated versions of fault-tolerant protocols have been proposed during the last years, as it is the case of the multi-level (two-level and beyond) checkpointing [10][11][12][13][14] or the hierarchical approach among others [15][16][17][18][19][20].…”

Section: Introductionmentioning

confidence: 99%

On the modelling of optimal coordinated checkpoint period in supercomputers

2018

View full text Add to dashboard Cite

This work revises current assumptions adopted in the checkpointing modelling and evaluates their impact on the attained prediction of the optimal coordinated singlelevel checkpoint period. An accurate a priori assessment of the optimal checkpoint period for a given computing facility is necessary as it drives the incurred overhead due to frequent checkpointing and, as a result, implies a drop in the resource steadystate availability. The present study discusses the impact of the order of approximation used in the single-level coordinated checkpoint modelling and follows on extending previous results of the optimal checkpoint period to explore the effects of the checkpoint rate on the cluster performance under total execution time and energy consumption policies, and in terms of resource availability. A consequence of a prescribed checkpoint rate with current technology is a critical size of the cluster above which the attained availability is too poor to become a cost-effective platform.Thus, some guidelines for the cluster sizing are indicated.

show abstract

Section: Introductionmentioning

confidence: 99%

On the modelling of optimal coordinated checkpoint period in supercomputers

2018

View full text Add to dashboard Cite

show abstract

A hybrid approach towards reduced checkpointing overhead in cloud-based applications

Sinha

Singh

Saini

2021

Peer-to-Peer Netw. Appl.

View full text Add to dashboard Cite

Chiron: Optimizing Fault Tolerance in QoS-aware Distributed Stream Processing Jobs

Geldenhuys

Thamsen

Kao

2020

2020 IEEE International Conference on Big Data (Big Data)

View full text Add to dashboard Cite

Fault tolerance is a property which needs deeper consideration when dealing with streaming jobs requiring high levels of availability and low-latency processing even in case of failures where Quality-of-Service constraints must be adhered to. Typically, systems achieve fault tolerance and the ability to recover automatically from partial failures by implementing Checkpoint and Rollback Recovery. However, this is an expensive operation which impacts negatively on the overall performance of the system and manually optimizing fault tolerance for specific jobs is a difficult and time consuming task.In this paper we introduce Chiron, an approach for automatically optimizing the frequency with which checkpoints are performed in streaming jobs. For any chosen job, parallel profiling runs are performed, each containing a variant of the configurations, with the resulting metrics used to model the impact of checkpoint-based fault tolerance on performance and availability. Understanding these relationships is key to minimizing performance objectives and meeting strict Quality-of-Service constraints. We implemented Chiron prototypically together with Apache Flink and demonstrate its usefulness experimentally.

show abstract

Two-Level Incremental Checkpoint Recovery Scheme for Reducing System Total Overheads

Cited by 10 publications

References 35 publications

On the modelling of optimal coordinated checkpoint period in supercomputers

On the modelling of optimal coordinated checkpoint period in supercomputers

A hybrid approach towards reduced checkpointing overhead in cloud-based applications

Chiron: Optimizing Fault Tolerance in QoS-aware Distributed Stream Processing Jobs

Contact Info

Product

Resources

About