Proactive Fault Tolerance in MPI Applications Via Task Migration

Chakravorty, Sayantan; Mendes, Celso L.; Kalé, Laxmikant V.

doi:10.1007/11945918_47

Cited by 59 publications

(43 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The feasibility of proactive FT has been demonstrated at the job scheduling level [34] and in Adaptive MPI [8], [7], [9] using a combination of (a) object virtualization techniques to migrate tasks and (b) causal message logging [16] within the MPI runtime system of Charm++ applications. In contrast to Charm++, our solution is coarser grained as FT is provided at the process level, thereby encapsulating most of the process context, including open file descriptors, which are beyond the MPI runtime layer.…”

Section: Related Workmentioning

confidence: 99%

“…1 The feasibility of health monitoring at various levels has recently been demonstrated for temperature-aware monitoring, e.g., by using ACPI [1], and, more generically, by critical-event prediction [40]. Particularly in systems with thousands of processors, fault handling becomes imperative, yet approaches range from application-level and runtime-level to the level of OS schedulers [8], [7], [9], [34]. These and other approaches differ from our work in that we promote live migration combined with health monitoring.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Proactive process-level live migration and back migration in HPC environments

Wang

Mueller

Engelmann

et al. 2012

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission.This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 seconds of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 seconds. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Proactive process-level live migration and back migration in HPC environments

Wang

Mueller

Engelmann

et al. 2012

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

show abstract

“…Another past effort targeted transparent MPI task migration using the Charm++ middleware and its Adaptive MPI (AMPI) [2]. This work primarily focused on the migration aspect and did not provide the feedbackloop control needed for proactive FT.…”

Section: Transparent Migration Mechanismsmentioning

confidence: 99%

Proactive Fault Tolerance Using Preemptive Migration

Engelmann

Vallée

Naughton

et al. 2009

2009 17th Euromicro International Conference on Parallel, Distributed and Network-Based Processing

View full text Add to dashboard Cite

Proactive fault tolerance (FT) in high-performance computing is a concept that prevents compute node failures from impacting running parallel applications by preemptively migrating application parts away from nodes that are about to fail. This paper provides a foundation for proactive FT by defining its architecture and classifying implementation options. This paper further relates prior work to the presented architecture and classification, and discusses the challenges ahead for needed supporting technologies.

show abstract

“…Our resulting implementation can easily be combined with reactive checkpoint/restart frameworks to trigger restarts after components have failed [2,[5][6][7][8][9][10][12][13][14]17,[17][18][19][20]22,23,26,27,29,[32][33][34][35][36].…”

Section: Introductionmentioning

confidence: 99%