A Fault Tolerance Protocol with Fast Fault Recovery

Chakravorty, Sayantan; Kalé, Laxmikant V.

doi:10.1109/ipdps.2007.370310

Cited by 41 publications

(33 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…If the system allows migratable tasks, then some of the tasks that live on the recovering node may be migrated to other nodes to be recovered in parallel. This scheme is an extension of messagelogging and is called parallel recovery [6]. It has the potential to reduce recovery time to a fraction of the normal rework time.…”

Section: Parallel Recoverymentioning

confidence: 99%

“…The second strategy is a particular version of message-logging [5] that requires messages to be stored, but avoids a global rollback in case of a failure. Finally, the third approach is called parallel recovery [6] and requires the system to allow tasks to migrate after a failure. This ability potentially reduces recovery time to a small fraction of re-execution time from checkpoint.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems

Meneses

Sarood

Kalé

2012

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Self Cite

View full text Add to dashboard Cite

Abstract-An exascale machine is expected to be delivered in the time frame 2018-2020. Such a machine will be able to tackle some of the hardest computational problems and to extend our understanding of Nature and the universe. However, to make that a reality, the HPC community has to solve a few important challenges. Resilience will become a prominent problem because an exascale machine will experience frequent failures due to the large amount of components it will encompass. Some form of fault tolerance has to be incorporated in the system to maintain the progress rate of applications as high as possible. In parallel, the system will have to be more careful about power management. There are two dimensions of power. First, in a power-limited environment, all the layers of the system have to adhere to that limitation (including the fault tolerance layer). Second, power will be relevant due to energy consumption: an exascale installation will have to pay a large energy bill. It is fundamental to increase our understanding of the energy profile of different fault tolerance schemes. This paper presents an evaluation of three different fault tolerance approaches: checkpoint/restart, message-logging and parallel recovery. Using programs from different programming models, we show parallel recovery is the most energy-efficient solution for an execution with failures. At the same time, parallel recovery is able to finish the execution faster than the other approaches. We explore the behavior of these approaches at extreme scales using an analytical model. At large scale, parallel recovery is predicted to reduce the total execution time of an application by 17% and reduce the energy consumption by 13% when compared to checkpoint/restart.

show abstract

Section: Parallel Recoverymentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems

Meneses

Sarood

Kalé

2012

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Self Cite

View full text Add to dashboard Cite

show abstract

“…The feasibility of proactive FT has been demonstrated at the job scheduling level [34] and in Adaptive MPI [8], [7], [9] using a combination of (a) object virtualization techniques to migrate tasks and (b) causal message logging [16] within the MPI runtime system of Charm++ applications. In contrast to Charm++, our solution is coarser grained as FT is provided at the process level, thereby encapsulating most of the process context, including open file descriptors, which are beyond the MPI runtime layer.…”

Section: Related Workmentioning

confidence: 99%

“…1 The feasibility of health monitoring at various levels has recently been demonstrated for temperature-aware monitoring, e.g., by using ACPI [1], and, more generically, by critical-event prediction [40]. Particularly in systems with thousands of processors, fault handling becomes imperative, yet approaches range from application-level and runtime-level to the level of OS schedulers [8], [7], [9], [34]. These and other approaches differ from our work in that we promote live migration combined with health monitoring.…”

Section: Introductionmentioning

confidence: 99%

Proactive process-level live migration and back migration in HPC environments

Wang

Mueller

Engelmann

et al. 2012

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission.This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 seconds of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 seconds. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively.

show abstract

“…Our resulting implementation can easily be combined with reactive checkpoint/restart frameworks to trigger restarts after components have failed [2,[5][6][7][8][9][10][12][13][14]17,[17][18][19][20]22,23,26,27,29,[32][33][34][35][36].…”

Section: Introductionmentioning

confidence: 99%