2016
DOI: 10.1145/2897189
|View full text |Cite
|
Sign up to set email alerts
|

Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors

Abstract: International audienceIn this paper, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to address both fail-stop and silent errors. The objective is to minimize either makespan or energy consumption. While DVFS is a popular approach for reducing the energy consumption, using lower speeds/voltages can increase the number of errors, thereby complicating the problem. We consider an application workflow whose dependence graph is a chain of tasks, and we study th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
21
0
4

Year Published

2016
2016
2024
2024

Publication Types

Select...
5
2
1

Relationship

6
2

Authors

Journals

citations
Cited by 22 publications
(26 citation statements)
references
References 46 publications
1
21
0
4
Order By: Relevance
“…which is consistent with the results obtained in [2,6,7], provided that a reliable silent error detector is available. However, as mentioned previously, such a detector is only known in some application-specific domains.…”
Section: General Process Replicationsupporting
confidence: 92%
See 1 more Smart Citation
“…which is consistent with the results obtained in [2,6,7], provided that a reliable silent error detector is available. However, as mentioned previously, such a detector is only known in some application-specific domains.…”
Section: General Process Replicationsupporting
confidence: 92%
“…Then, for each of the simulated scenarios, we compare the simulated efficiency to the theoretical value, obtained using the model equations for S(P opt ). As pointed out in Section 6.1, process and group duplications lead to identical patterns, so we have merged the two scenarios and compared it against process and group triplications 6 . The rest of this section presents the simulation results, most of which focus on coping with silent errors only, with the exception of Section 8.5 which considers both fail-stop and silent errors.…”
Section: Simulation Setupmentioning
confidence: 99%
“…When the workflow consists of a linear chain of tasks, the problem of finding the optimal checkpoint strategy, i.e., determining which tasks to checkpoint, has been solved by Toueg and Babaoglu [34] using a dynamic programming algorithm. The algorithm of [34] was later extended in [8] to cope with both fail-stop and silent errors simultaneously. When the workflow is general but comprised of parallel tasks that each executes on the whole platform, the problem of placing checkpoints is NP-complete for simple join graphs [5] (this is because the original workflow is not a chain but must be linearized).…”
Section: Related Workmentioning
confidence: 99%
“…Note that the tasks can themselves be parallel, but the execution flow is sequential, which dramatically limits the amount of re-execution in case of a failure. The algorithm of [16] was later extended in [47] to cope with both fail-stop and silent errors simultaneously.…”
Section: Fail-stop Failuresmentioning
confidence: 99%