2009
DOI: 10.1109/tpds.2008.93
|View full text |Cite
|
Sign up to set email alerts
|

Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
53
0

Year Published

2010
2010
2017
2017

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 74 publications
(53 citation statements)
references
References 26 publications
0
53
0
Order By: Relevance
“…Several authors have suggested techniques that identify tasks on the critical path, and then making scheduling decisions that attempt to ensure the timely execution of these tasks [32,33]. A widely used technique to cope with soft errors is task replication, the challenge being to avoid over-duplicating tasks so CkptAll CkptNone p fail = 0.01 p fail = 0.001 p fail = 0.0001 12 11 11 11 11 11 11 11 31 28 28 28 31 29 28 as to strike a good balance between fast failure-free executions and resilient executions [34]. Two representative practical frameworks are the NARBIT system [35], which recovers from soft errors via task replication and work stealing, and Nanos [36,37], a runtime system that supports the OpenMP programming model.…”
Section: Soft and Silent Errorsmentioning
confidence: 99%
“…Several authors have suggested techniques that identify tasks on the critical path, and then making scheduling decisions that attempt to ensure the timely execution of these tasks [32,33]. A widely used technique to cope with soft errors is task replication, the challenge being to avoid over-duplicating tasks so CkptAll CkptNone p fail = 0.01 p fail = 0.001 p fail = 0.0001 12 11 11 11 11 11 11 11 31 28 28 28 31 29 28 as to strike a good balance between fast failure-free executions and resilient executions [34]. Two representative practical frameworks are the NARBIT system [35], which recovers from soft errors via task replication and work stealing, and Nanos [36,37], a runtime system that supports the OpenMP programming model.…”
Section: Soft and Silent Errorsmentioning
confidence: 99%
“…It computes a remapping interval during which it remaps those jobs that are assigned to a faulty resource and are inactive to some other resource in advance before it begins its execution. A minmax checkpoint placement method [1] is introduced that determines the suboptimal checkpoint sequence under uncertain circumstances in terms of the system failure time distribution. However, even if the (sub)optimal checkpointing interval is computed beforehand, the distributed system or application parameters upon which the interval is based will presumably change over time.…”
Section: Related Workmentioning
confidence: 99%
“…The grid model (Figure 1) considered in this paper consists of [1]: geographically distributed computational sites with many computational resources (r) at each site. The latter include a user interface (UI) through which the jobs are submitted into the system; a Resource Broker (RB) which is used to identify all the available resources, a scheduler(S) to schedule the job to the available resources.…”
Section: Proposed Workmentioning
confidence: 99%
See 2 more Smart Citations