2006
DOI: 10.1007/11945918_47
|View full text |Cite
|
Sign up to set email alerts
|

Proactive Fault Tolerance in MPI Applications Via Task Migration

Abstract: Abstract. Failures are likely to be more frequent in systems with thousands of processors. Therefore, schemes for dealing with faults become increasingly important. In this paper, we present a fault tolerance solution for parallel applications that proactively migrates execution from processors where failure is imminent. Our approach assumes that some failures are predictable, and leverages the features in current hardware devices supporting early indication of faults. We use the concepts of processor virtuali… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
42
0
1

Year Published

2007
2007
2018
2018

Publication Types

Select...
3
3
3

Relationship

0
9

Authors

Journals

citations
Cited by 59 publications
(43 citation statements)
references
References 23 publications
0
42
0
1
Order By: Relevance
“…The feasibility of proactive FT has been demonstrated at the job scheduling level [34] and in Adaptive MPI [8], [7], [9] using a combination of (a) object virtualization techniques to migrate tasks and (b) causal message logging [16] within the MPI runtime system of Charm++ applications. In contrast to Charm++, our solution is coarser grained as FT is provided at the process level, thereby encapsulating most of the process context, including open file descriptors, which are beyond the MPI runtime layer.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The feasibility of proactive FT has been demonstrated at the job scheduling level [34] and in Adaptive MPI [8], [7], [9] using a combination of (a) object virtualization techniques to migrate tasks and (b) causal message logging [16] within the MPI runtime system of Charm++ applications. In contrast to Charm++, our solution is coarser grained as FT is provided at the process level, thereby encapsulating most of the process context, including open file descriptors, which are beyond the MPI runtime layer.…”
Section: Related Workmentioning
confidence: 99%
“…1 The feasibility of health monitoring at various levels has recently been demonstrated for temperature-aware monitoring, e.g., by using ACPI [1], and, more generically, by critical-event prediction [40]. Particularly in systems with thousands of processors, fault handling becomes imperative, yet approaches range from application-level and runtime-level to the level of OS schedulers [8], [7], [9], [34]. These and other approaches differ from our work in that we promote live migration combined with health monitoring.…”
Section: Introductionmentioning
confidence: 99%
“…Another past effort targeted transparent MPI task migration using the Charm++ middleware and its Adaptive MPI (AMPI) [2]. This work primarily focused on the migration aspect and did not provide the feedbackloop control needed for proactive FT.…”
Section: Transparent Migration Mechanismsmentioning
confidence: 99%
“…Our resulting implementation can easily be combined with reactive checkpoint/restart frameworks to trigger restarts after components have failed [2,[5][6][7][8][9][10][12][13][14]17,[17][18][19][20]22,23,26,27,29,[32][33][34][35][36].…”
Section: Introductionmentioning
confidence: 99%