2007 IEEE International Parallel and Distributed Processing Symposium 2007
DOI: 10.1109/ipdps.2007.370310
|View full text |Cite
|
Sign up to set email alerts
|

A Fault Tolerance Protocol with Fast Fault Recovery

Abstract: Fault tolerance is an important issue for large machines with tens or hundreds of thousands of processors. Checkpoint-based methods, currently used on most machines, rollback all processors to previous checkpoints after a crash. This wastes a significant amount of computation as all processors have to redo all the computation from that checkpoint onwards. In addition, recovery time is bound by the time between the last checkpoint and the crash. Protocols based on message logging avoid the problem of rolling ba… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
33
0

Year Published

2008
2008
2013
2013

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 41 publications
(33 citation statements)
references
References 19 publications
0
33
0
Order By: Relevance
“…If the system allows migratable tasks, then some of the tasks that live on the recovering node may be migrated to other nodes to be recovered in parallel. This scheme is an extension of messagelogging and is called parallel recovery [6]. It has the potential to reduce recovery time to a fraction of the normal rework time.…”
Section: Parallel Recoverymentioning
confidence: 99%
See 1 more Smart Citation
“…If the system allows migratable tasks, then some of the tasks that live on the recovering node may be migrated to other nodes to be recovered in parallel. This scheme is an extension of messagelogging and is called parallel recovery [6]. It has the potential to reduce recovery time to a fraction of the normal rework time.…”
Section: Parallel Recoverymentioning
confidence: 99%
“…The second strategy is a particular version of message-logging [5] that requires messages to be stored, but avoids a global rollback in case of a failure. Finally, the third approach is called parallel recovery [6] and requires the system to allow tasks to migrate after a failure. This ability potentially reduces recovery time to a small fraction of re-execution time from checkpoint.…”
Section: Introductionmentioning
confidence: 99%
“…The feasibility of proactive FT has been demonstrated at the job scheduling level [34] and in Adaptive MPI [8], [7], [9] using a combination of (a) object virtualization techniques to migrate tasks and (b) causal message logging [16] within the MPI runtime system of Charm++ applications. In contrast to Charm++, our solution is coarser grained as FT is provided at the process level, thereby encapsulating most of the process context, including open file descriptors, which are beyond the MPI runtime layer.…”
Section: Related Workmentioning
confidence: 99%
“…1 The feasibility of health monitoring at various levels has recently been demonstrated for temperature-aware monitoring, e.g., by using ACPI [1], and, more generically, by critical-event prediction [40]. Particularly in systems with thousands of processors, fault handling becomes imperative, yet approaches range from application-level and runtime-level to the level of OS schedulers [8], [7], [9], [34]. These and other approaches differ from our work in that we promote live migration combined with health monitoring.…”
Section: Introductionmentioning
confidence: 99%
“…Our resulting implementation can easily be combined with reactive checkpoint/restart frameworks to trigger restarts after components have failed [2,[5][6][7][8][9][10][12][13][14]17,[17][18][19][20]22,23,26,27,29,[32][33][34][35][36].…”
Section: Introductionmentioning
confidence: 99%