Model-based performance evaluation of distributed checkpointing protocols

Agbaria, Adnan; Friedman, Roy

doi:10.1016/j.peva.2007.09.001

Cited by 10 publications

(5 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Tuple spaces along with checkpointing and replication mechanisms have been applied to grid scheduling in [12]. The performance of distributed checkpointing protocols has been evaluated by Agbaria and Friedman [4]. They consider the overhead ratio which also takes the recovery time into account in performance evaluation.…”

Section: Related Workmentioning

confidence: 99%

Failure-aware resource management for high-availability computing clusters with distributed virtual machines

2010

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

Section: Related Workmentioning

confidence: 99%

Failure-aware resource management for high-availability computing clusters with distributed virtual machines

2010

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

“…for the case e t+1 = rec ij (msg), P i ∈ I, P j ∈ CP (S) we show that it cannot occur. So suppose there was such an event, then This implies rec ij ( msg, 0 ) is in the history of p S j (by (1) and (2)) and occurs before mcp taken S i or cp taken S i (by (1) and 3). Hence, by Rule 2.1, dep S j (i) = 1.…”

Section: Resultsmentioning

confidence: 99%

“…A concise formal model can be the base of qualitative comparisons which would add to existing quantitative comparisons based on simulations, like [1,12]. We gave such a comparison with the blocking queue algorithm introduced in [13].…”

Section: Discussionmentioning

confidence: 99%

Analyzing Mutable Checkpointing via Invariants

Aggarwal

Kiehn

2015

Fundamentals of Software Engineering

View full text Add to dashboard Cite

The well-known coordinated snapshot algorithm of mutable checkpointing [7-9] is studied. We equip it with a concise formal model and analyze its operational behavior via an invariant characterizing the snapshot computation. By this we obtain a clear understanding of the intermediate behavior and a correctness proof of the final snapshot based on a strong notion of consistency (reachability within the partial order representing the underlying computation). The formal model further enables a comparison with the blocking queue algorithm [13] introduced for the same scenario and with the same objective. From a broader perspective, we advocate the use of formal semantics to formulate and prove correctness of distributed algorithms.

show abstract

“…The conventional method for failure management and fault tolerance relies on checkpointing/restart mechanisms, which periodically save a snapshot of a system to a stable storage and use it to recover the system from failures reactively; see [10] for a comprehensive review and [23], [25], [3], [6] for examples. However, this method does not prevent systems from failures, and work loss is inevitable due to its rollback process [10].…”

Section: R Wmentioning

confidence: 99%

Failure prediction for autonomic management of networked computer systems with availability assurance

Zhang

2010

2010 IEEE International Symposium on Parallel &Amp; Distributed Processing, Workshops and PHD Forum (IPDPSW)

View full text Add to dashboard Cite

Networked computer systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. Failure occurrence as well as its impact on system performance and operation costs are becoming an increasingly important concern to system designers and administrators. To achieve self-management of failures and resources in networked computer systems, we propose a framework for autonomic failure management with hierarchical failure prediction functionality for large coalition systems, such as coalition clusters and compute grids. It analyzes node, cluster and system wide failure behaviors and forecasts the prospective failure occurrences based on quantified failure dynamics. Failure correlations are inspected by the predictor. Experimental results in a computational grid on campus show the offline and online predictions by our predictors accurately forecast the failure trend and capture failure correlations in the production environment.

show abstract

Model-based performance evaluation of distributed checkpointing protocols

Cited by 10 publications

References 29 publications

Failure-aware resource management for high-availability computing clusters with distributed virtual machines

Failure-aware resource management for high-availability computing clusters with distributed virtual machines

Analyzing Mutable Checkpointing via Invariants

Failure prediction for autonomic management of networked computer systems with availability assurance

Contact Info

Product

Resources

About