Proceedings of the 21st Annual International Conference on Supercomputing 2007
DOI: 10.1145/1274971.1274978
|View full text |Cite
|
Sign up to set email alerts
|

Proactive fault tolerance for HPC with Xen virtualization

Abstract: Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today's systems, node failures can often be anticipated by detecting a deteriorating health status.Instead of a reactive scheme for fault tolerance (FT), we are promoting a proactive one where pro… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
154
1
2

Year Published

2008
2008
2018
2018

Publication Types

Select...
4
4

Relationship

3
5

Authors

Journals

citations
Cited by 278 publications
(157 citation statements)
references
References 40 publications
0
154
1
2
Order By: Relevance
“…We next provide the performance comparison of our approach to another solution at the OS virtualization layer in the context of proactive FT of MPI applications [33]. The common benchmarks measured with both solutions on the same hardware were NPB BT, CG, LU and SP.…”
Section: F Process-level Live Migration Vs Xen Virtualization Live mentioning
confidence: 99%
See 1 more Smart Citation
“…We next provide the performance comparison of our approach to another solution at the OS virtualization layer in the context of proactive FT of MPI applications [33]. The common benchmarks measured with both solutions on the same hardware were NPB BT, CG, LU and SP.…”
Section: F Process-level Live Migration Vs Xen Virtualization Live mentioning
confidence: 99%
“…Furthermore, our approach provides a live migration mechanism, which supports continued execution of MPI applications during much of the migration time. This solution parallels live migration at the OS virtualization layer [10], which has been studied in the context of proactive FT of MPI applications [33], an approach that supports integrated health-based monitoring and proactive live migration over Xen guests. We contribute process-level live migration and demonstrate its superior efficiency to of OS-level virtualization.…”
Section: Related Workmentioning
confidence: 99%
“…net) is a software effort to allow full access to IPMI information. OpenIPMI has already been used in Type 1 feedback-loop control solutions [6,13]. It can be used with any type of feedback-loop control.…”
Section: System Monitoringmentioning
confidence: 99%
“…Recent efforts in proactive FT primarily targeted two aspects, failure prediction [4,9] and process-or virtual-machine-level migration [13,6]. Other initial work focused on a proactive FT framework [12], which combines both to perform prediction triggered migration.…”
Section: Introductionmentioning
confidence: 99%
“…For instance [1] shows the interest of proactive fault tolerance for MPI applications, using Charm++ capabilities for the migration of MPI processes, migration and pause/unpause being two standard mechanisms for the implementation of proactive fault tolerance strategies. In [5] the authors show the benefit of using virtual machine (VM) migration for the implementation of proactive fault tolerance capabilities. However, these solutions directly implement a proactive fault tolerance policy into the system.…”
Section: Introductionmentioning
confidence: 99%