2007 IEEE International Conference on Cluster Computing 2007
DOI: 10.1109/clustr.2007.4629244
|View full text |Cite
|
Sign up to set email alerts
|

Evaluation of fault-tolerant policies using simulation

Abstract: Abstract-Various mechanisms for fault-tolerance (FT) are used today in order to reduce the impact of failures on application execution. In the case of system failure, standard FT mechanisms are checkpoint/restart (for reactive FT) and migration (for pro-active FT). However, each of these mechanisms create an overhead on application execution, overhead that for instance becomes critical on large-scale systems where previous studies have shown that applications may spend more time checkpointing state than perfor… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
16
0

Year Published

2008
2008
2022
2022

Publication Types

Select...
3
2
2

Relationship

3
4

Authors

Journals

citations
Cited by 25 publications
(17 citation statements)
references
References 11 publications
1
16
0
Order By: Relevance
“…Instead we decided to compare the results of the implementation of a specific proactive fault tolerance policy with the results of the same policy obtained by simulation. Our simulator [8] is based on the LLNL's ASCI white system system logs and our experimentation platform consists of a 40 nodes cluster. Each physical node has a single Xen VM having 250MB of memory (the number of VM is explicitly specified if VMs are stacked on physical machines); host OSes have 200MB of memory.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Instead we decided to compare the results of the implementation of a specific proactive fault tolerance policy with the results of the same policy obtained by simulation. Our simulator [8] is based on the LLNL's ASCI white system system logs and our experimentation platform consists of a 40 nodes cluster. Each physical node has a single Xen VM having 250MB of memory (the number of VM is explicitly specified if VMs are stacked on physical machines); host OSes have 200MB of memory.…”
Section: Discussionmentioning
confidence: 99%
“…However, the architecture is designed to be easily extended to other mechanisms, such as process migration and process pause/unpause. This framework, coupled to our fault tolerance simulator [8] provides a complete set of tools for the study of proactive fault tolerance policies.…”
Section: Introductionmentioning
confidence: 99%
“…They also model the migration cost and introduce a dynamic scheduling mechanism accordingly [14]. In their paper, Tikotekar et al also present a simulation framework that evaluates different FT mechanisms and policies, including a combination of reactive FT and proactive FT to decrease the number of checkpoints [48], which obtained the best results among all the real and simulated FT mechanisms and policies. These prior works with their fault models, FT mechanisms for fault occurrences and their evaluation simulations, confirm that the process migration is a suitable approach for proactive FT with lower cost than OS virtualization, which reinforces the significance of our solution.…”
Section: Related Workmentioning
confidence: 99%
“…In this spirit, our work focuses on fault-tolerant middleware for HPC systems. More specifically, this paper promotes process-level live migration combined with health monitoring for a proactive FT approach that complements existing C/R schemes with self healing whose fault model is based on the work by Tikotekar et al [48].…”
Section: Introductionmentioning
confidence: 99%
“…Recent work in this area utilized simulations to evaluate different trade-off models when combining preemptive migration with checkpoint/restart [11]. Using failure logs from Lawrence Livermore National Laboratory, the impact of failure prediction accuracy was evaluated and put in context with restart counts and checkpointing frequency.…”
Section: Proactive + Reactive Fault Tolerancementioning
confidence: 99%