2009 17th Euromicro International Conference on Parallel, Distributed and Network-Based Processing 2009
DOI: 10.1109/pdp.2009.31
|View full text |Cite
|
Sign up to set email alerts
|

Proactive Fault Tolerance Using Preemptive Migration

Abstract: Proactive fault tolerance (FT) in high-performance computing is a concept that prevents compute node failures from impacting running parallel applications by preemptively migrating application parts away from nodes that are about to fail. This paper provides a foundation for proactive FT by defining its architecture and classifying implementation options. This paper further relates prior work to the presented architecture and classification, and discusses the challenges ahead for needed supporting technologies. Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
41
0

Year Published

2010
2010
2022
2022

Publication Types

Select...
4
2
2

Relationship

1
7

Authors

Journals

citations
Cited by 76 publications
(41 citation statements)
references
References 10 publications
0
41
0
Order By: Relevance
“…Proactive fault tolerance [6] avoids experiencing failures through preventative measures, such as by migrating application parts away from compute nodes that are "about to fail". It relies on a feedback-loop control (Figure 1) with continuous health monitoring, data analysis, and application reallocation.…”
Section: System Monitoringmentioning
confidence: 99%
See 1 more Smart Citation
“…Proactive fault tolerance [6] avoids experiencing failures through preventative measures, such as by migrating application parts away from compute nodes that are "about to fail". It relies on a feedback-loop control (Figure 1) with continuous health monitoring, data analysis, and application reallocation.…”
Section: System Monitoringmentioning
confidence: 99%
“…We deployed the framework on the same 64-node cluster (in a 32-node degraded fashion due to faulty hardware) that was used for our earlier investigations (see Section 2.1 and [9,6]). For this test, we sampled 18 metrics on 32 nodes over a 4 hour period with constantly varying classes and a sample interval for all metrics of 30 seconds.…”
Section: Monitoring Data Accumulationmentioning
confidence: 99%
“…A number of advanced resilience technologies have been developed and/or are currently in development, including checkpoint/restart-specific file and storage systems, incremental/differential checkpointing, message logging for uncoordinated checkpointing, fault tolerant message passing interface (FT-MPI), containment domains, algorithm-based fault tolerance (ABFT), rejuvenation, reliability-aware scheduling, proactive migration, and redundancy [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19]. However, there are currently no tools, methods, and metrics to compare them fairly, especially at extreme scale, and to identify the cost/benefit trade-off.…”
Section: Introductionmentioning
confidence: 99%
“…It is not desirable to have to restart a job from the beginning if it has been executing for hours or days or months [6]. A key challenge in maintaining the seamless (or near seamless) execution of such jobs in the event of failures is addressed under research in fault tolerance [7,8,9,10].Many jobs rely on fault tolerant approaches that are implemented in the middleware supporting the job (for example [6,11,12,13]). The conventional fault tolerant mechanism supported by the middleware is checkpointing [14,15,16,17], which involves the periodic recording of intermediate states of execution of a job to which execution can be returned if a fault occurs.…”
mentioning
confidence: 99%
“…It is not desirable to have to restart a job from the beginning if it has been executing for hours or days or months [6]. A key challenge in maintaining the seamless (or near seamless) execution of such jobs in the event of failures is addressed under research in fault tolerance [7,8,9,10].…”
mentioning
confidence: 99%