Proactive fault tolerance for HPC with Xen virtualization

Journal of Parallel and Distributed Computing

et al. 2012

Self Cite

As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission.This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 seconds of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 seconds. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively.

Section: F Process-level Live Migration Vs Xen Virtualization Live mentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Proactive process-level live migration and back migration in HPC environments

Wang

Mueller

Journal of Parallel and Distributed Computing

et al. 2012

Self Cite

“…net) is a software effort to allow full access to IPMI information. OpenIPMI has already been used in Type 1 feedback-loop control solutions [6,13]. It can be used with any type of feedback-loop control.…”

Section: System Monitoringmentioning

confidence: 99%

“…Recent efforts in proactive FT primarily targeted two aspects, failure prediction [4,9] and process-or virtual-machine-level migration [13,6]. Other initial work focused on a proactive FT framework [12], which combines both to perform prediction triggered migration.…”

Section: Introductionmentioning

confidence: 99%

Proactive Fault Tolerance Using Preemptive Migration

2009 17th Euromicro International Conference on Parallel, Distributed and Network-Based Processing

Vallée

Naughton

et al. 2009

Self Cite

Proactive fault tolerance (FT) in high-performance computing is a concept that prevents compute node failures from impacting running parallel applications by preemptively migrating application parts away from nodes that are about to fail. This paper provides a foundation for proactive FT by defining its architecture and classifying implementation options. This paper further relates prior work to the presented architecture and classification, and discusses the challenges ahead for needed supporting technologies.

“…For instance [1] shows the interest of proactive fault tolerance for MPI applications, using Charm++ capabilities for the migration of MPI processes, migration and pause/unpause being two standard mechanisms for the implementation of proactive fault tolerance strategies. In [5] the authors show the benefit of using virtual machine (VM) migration for the implementation of proactive fault tolerance capabilities. However, these solutions directly implement a proactive fault tolerance policy into the system.…”

Section: Introductionmentioning

confidence: 99%

A Framework for Proactive Fault Tolerance

Vallée

2008 Third International Conference on Availability, Reliability and Security

Tikotekar

et al. 2008

Self Cite

Fault tolerance is a major concern to guarantee availability of critical services as well as application execution. Traditional approaches for fault tolerance include checkpoint/restart or duplication. However it is also possible to anticipate failures and proactively take action before failures occur in order to minimize failure impact on the system and application execution.This document presents a proactive fault tolerance framework. This framework can use different proactive fault tolerance mechanisms, i.e., migration and pause/unpause. The framework also allows the implementation of new proactive fault tolerance policies thanks to a modular architecture. A first proactive fault tolerance policy has been implemented and preliminary experimentations have been done based on system-level virtualization and compared with results obtained by simulation.