Incremental Checkpoint Schemes for Weibull Failure Distribution

Păun, Mihaela; Naksinehaboon, Nichamon; Nassar, Raja; Leangsuksun, Chokchai; Scott, Stephen L.; Taerat, Narate

doi:10.1142/s0129054110007283

Cited by 17 publications

(26 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Here, time between failures is the Weibull distribution. The failure rate increase or decrease with time, it may not [11], [12]. Several studies analyze the time [8], [9], [12], [13], [14], [15].…”

Section: Related Workmentioning

confidence: 99%

“…We make use of the terms node, processor and resource interchangeably. The time between failures of nodes is taken as to follow the Weibull distribution [8], [9], [10], [11].…”

Section: Introductionmentioning

confidence: 99%

“…variable of time-to-failure of uence T i | i = 1, 2,… can be ocess [11]. Therefore by mean we can derive a checkpoint ally optimizes the expected ure distribution [17].…”

mentioning

confidence: 99%

“…As a result, the fraction is a random variable which makes k a random variable between (0, 1) to be determined [11]. Variable k is used to determine the Recomputing time T Re .…”

mentioning

confidence: 99%

See 3 more Smart Citations

Failure-aware scheduling in grid considering Weibull failure distribution

Singh

Garg

2013

2013 International Conference on Recent Trends in Information Technology (ICRTIT)

View full text Add to dashboard Cite

In Grid computing environment, as the resources are more heterogeneous, geographically distributed, complex and owned by different organizations, they are more prone to failures. Application scheduling in such a environment is very crucial. Generally, during application/job scheduling only performance factor of resources are considered. But if a node with high computational power also have high failure rate, then there is no such benefit of allocating task to that node because every time a failure occurs it needs recovery and in turn costs in term of time. Thus, failure increases make-span for the job and decreases system/node performance. A node with comparatively lower computational capacity and lower failure rate may give better performance and reduced make-span. So it would be a great idea if we take into consideration failure rate and computational capacity of resources during scheduling. In this paper we have proposed an approach for scheduling the tasks. We recalculate the computational capacity of resources by finding the expected wasted time due to presence of failure and then the tasks are scheduled according to this new computational capacity. Here the failure of nodes is treated as to follow Weibull distribution.

show abstract

Section: Related Workmentioning

confidence: 99%

“…We make use of the terms node, processor and resource interchangeably. The time between failures of nodes is taken as to follow the Weibull distribution [8], [9], [10], [11].…”

Section: Introductionmentioning

confidence: 99%

“…variable of time-to-failure of uence T i | i = 1, 2,… can be ocess [11]. Therefore by mean we can derive a checkpoint ally optimizes the expected ure distribution [17].…”

mentioning

confidence: 99%

“…As a result, the fraction is a random variable which makes k a random variable between (0, 1) to be determined [11]. Variable k is used to determine the Recomputing time T Re .…”

mentioning

confidence: 99%

See 2 more Smart Citations

Failure-aware scheduling in grid considering Weibull failure distribution

Singh

Garg

2013

2013 International Conference on Recent Trends in Information Technology (ICRTIT)

View full text Add to dashboard Cite

show abstract

“…However, the models in [9][10][11][12][13][14] assume that no failure event occurs during the rollback recovery phase, which is not a considerate representation for the characteristic of the rollback recovery execution. Besides, [15][16][17][18][19] also intended to determine the optimal checkpoint sequence under a certain circumstance in terms of the failure distribution.…”

Section: Introductionmentioning

confidence: 99%

Checkpoint scheduling model for optimality

Xu¹,

Men²,

Li³

et al. 2011

Information Processing Letters

View full text Add to dashboard Cite

Optimizing checkpoint‐based fault‐tolerance in distributed stream processing systems: Theory to practice

2021

View full text Add to dashboard Cite

Fault-tolerance is an essential part of a stream processing system that guarantees data analysis could continue even after failures. State-of-the-art distributed stream processing systems use checkpointing to support fault-tolerance for stateful computations where the state of the computations is periodically persisted. However, the frequency of performing checkpoints impacts the performance (utilization, latency, and throughput) of the system as the checkpointing process consumes resources and time that can be used for actual computations. In practice, systems are often configured to perform checkpoints based on crude values ignoring factors such as checkpoint and restart costs, leading to suboptimal performance. In our previous work, we proposed a theoretical optimal checkpoint interval that maximizes the system utilization for stream processing systems to minimize the impact of checkpointing on system performance.In this article, we investigate the practical benefits of our proposed theoretical optimal by conducting experiments in a real-world cloud setting using different streaming applications; we use Apache Flink, a well-known stream processing system for our experiments. The experiment results demonstrate that an optimal interval can achieve better utilization, confirming the practicality of the theoretical model when applied to real-world applications. We observed utilization improvements from 10% to 200% for a range of failure rates from 0.3 failures per hour to 0.075 failures per minute. Moreover, we explore how performance measures: latency and throughput are affected by the optimal interval.Our observations demonstrate that significant improvements can be achieved using the optimal interval for both latency and throughput.

show abstract

Incremental Checkpoint Schemes for Weibull Failure Distribution

Cited by 17 publications

References 4 publications

Failure-aware scheduling in grid considering Weibull failure distribution

Failure-aware scheduling in grid considering Weibull failure distribution

Checkpoint scheduling model for optimality

Optimizing checkpoint‐based fault‐tolerance in distributed stream processing systems: Theory to practice

Contact Info

Product

Resources

About