Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors

Benoît, Anne; Cavelan, Aurélien; Robert, Yves; Sun, Hongyang

doi:10.1145/2897189

Cited by 22 publications

(26 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…which is consistent with the results obtained in [2,6,7], provided that a reliable silent error detector is available. However, as mentioned previously, such a detector is only known in some application-specific domains.…”

Section: General Process Replicationsupporting

confidence: 92%

“…Then, for each of the simulated scenarios, we compare the simulated efficiency to the theoretical value, obtained using the model equations for S(P opt ). As pointed out in Section 6.1, process and group duplications lead to identical patterns, so we have merged the two scenarios and compared it against process and group triplications 6 . The rest of this section presents the simulation results, most of which focus on coping with silent errors only, with the exception of Section 8.5 which considers both fail-stop and silent errors.…”

Section: Simulation Setupmentioning

confidence: 99%

See 1 more Smart Citation

Coping with silent and fail-stop errors at scale by combining replication and checkpointing

Benoît

Cavelan

Cappello

et al. 2018

Journal of Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

Section: General Process Replicationsupporting

confidence: 92%

Section: Simulation Setupmentioning

confidence: 99%

Coping with silent and fail-stop errors at scale by combining replication and checkpointing

Benoît

Cavelan

Cappello

et al. 2018

Journal of Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

“…When the workflow consists of a linear chain of tasks, the problem of finding the optimal checkpoint strategy, i.e., determining which tasks to checkpoint, has been solved by Toueg and Babaoglu [34] using a dynamic programming algorithm. The algorithm of [34] was later extended in [8] to cope with both fail-stop and silent errors simultaneously. When the workflow is general but comprised of parallel tasks that each executes on the whole platform, the problem of placing checkpoints is NP-complete for simple join graphs [5] (this is because the original workflow is not a chain but must be linearized).…”

Section: Related Workmentioning

confidence: 99%

A Generic Approach to Scheduling and Checkpointing Workflows

Han

Fèvre

Canon

et al. 2018

Proceedings of the 47th International Conference on Parallel Processing

Self Cite

View full text Add to dashboard Cite

Abstract:This work deals with scheduling and checkpointing strategies to execute scientific workflows on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to target fail-stop errors for arbitrary workflows. Most previous work addresses soft errors, which corrupt the task being executed by a processor but do not cause the entire memory of that processor to be lost, contrarily to fail-stop errors. We revisit classical mapping heuristics such as HEFT and MinMin and complement them with several checkpointing strategies. The objective is to derive an efficient trade-off between checkpointing every task (CkptAll), which is an overkill when failures are rare events, and checkpointing no task (CkptNone), which induces dramatic re-execution overhead even when only a few failures strike during execution. Contrarily to previous work, our approach applies to arbitrary workflows, not just special classes of dependence graphs such as M-SPGs (Minimal Series-Parallel Graphs). Extensive experiments report significant gain over both CkptAll and CkptNone, for a wide variety of workflows. Résumé :Ce travail porte sur l'ordonnancement et les stratégies de checkpoint utiles à l'exécution d'applications scientifiques structurées en forme de graphes de tâches, sur des plateformes à grande échelle, sensibles aux fautes. A notre connaissance, ce travail est le premier à traiter des erreurs fatales pour des graphes de tâches arbitraires. La plupart des travaux existants traitent des erreurs silencieuses, qui corrompent la tâche en train d'être exécutée sur un processeur mais ne provoquent pas la disparition totale de la mémoire de ce processeur, contrairement aux erreurs fatales. Nous revisitons les heuristiques d'allocation classiques telles que HEFT et MinMin, auxquelles nous rajoutons plusieurs stratégies de checkpoint. L'objectif est de trouver un juste milieu efficace entre checkpointer toutes les tâches (CkptAll), ce qui est trop lourd quand les erreurs surviennent rarement, et n'en checkpointer aucune (CkptNone), ce qui induit des temps de ré-exécution élevés, même quand seulement quelques fautes surgissent durant l'exécution. Contrairement à ce qui a été fait précédemment, notre approche s'applique à des graphes de tâches quelconques, pas seulement à certaines classes spéciales de graphes de tâches comme les M-SPGs (Graphe Série-Parallèle Minimal). Plusieurs expériences montrent un gain significatif par rapport à CkptAll et CkptNone, pour une large variété de graphes de tâches.

show abstract

“…Note that the tasks can themselves be parallel, but the execution flow is sequential, which dramatically limits the amount of re-execution in case of a failure. The algorithm of [16] was later extended in [47] to cope with both fail-stop and silent errors simultaneously.…”

Section: Fail-stop Failuresmentioning

confidence: 99%

Checkpointing Workflows for Fail-Stop Errors

Han¹,

Canon²,

Casanova³

et al. 2017

2017 IEEE International Conference on Cluster Computing (CLUSTER)

Self Cite

View full text Add to dashboard Cite

Abstract:We consider the problem of orchestrating the execution of workflow applications structured as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail-stop failures. The objective is to minimize expected overall execution time, or makespan. A solution to this problem consists of a schedule of the workflow tasks on the available processors and of a decision of which application data to checkpoint to stable storage, so as to mitigate the impact of processor failures. For general DAGs this problem is hopelessly intractable. In fact, given a solution, computing its expected makespan is still a difficult problem. To address this challenge, we consider a restricted class of graphs, Minimal Series-Parallel Graphs (M-SPGs). It turns out that many real-world workflow applications are naturally structured as M-SPGs. For this class of graphs, we propose a recursive list-scheduling algorithm that exploits the M-SPG structure to assign sub-graphs to individual processors, and uses dynamic programming to decide which tasks in these sub-gaphs should be checkpointed. Furthermore, it is possible to efficiently compute the expected makespan for the solution produced by this algorithm, using a first-order approximation of task weights and existing evaluation algorithms for 2-state probabilistic DAGs. We assess the performance of our algorithm for production workflow configurations, comparing it to (i) an approach in which all application data is checkpointed, which corresponds to the standard way in which most production workflows are executed today; and (ii) an approach in which no application data is checkpointed. Our results demonstrate that our algorithm strikes a good compromise between these two approaches, leading to lower checkpointing overhead than the former and to better resilience to failure than the latter. To the best of our knowledge, this is the first scheduling/checkpointing algorithm for workflow applications with fail-stop failures that considers workflow structures more general than mere linear chains of tasks.Key-words: workflow, checkpoint, fail-stop error, resilience. Stratégies de checkpoint pour les workflows en présence d'erreurs fatalesRésumé : Ce rapport considère l'ordonnancement de workflows (applications structurées en forme de graphes de tâches acycliques, ou DAGs) sur des plates-formes parallèlesà grandé echelle, soumisesà des erreurs fatales. L'objectif est de minimiser l'espérance du temps total d'exécution, ou makespan. Une solutionà ce problème comprend l'allocation ordonnée des tâches aux processeurs, et les décisions de checkpoint: quelles tâches sont suivies d'un checkpoint? Même pour une solution donnée, le calcul du makespan reste difficile. Nous nous restreignonsà une classe de DAGs particuliers, les graphes séries-parallèles minimaux, ou MSPGs. De nombreux workflows issus des applications ont pour graphe un M-SPG. Pour de tels graphes, nous proposons un algorithme qui utilise la structure récursive du M-SPG pour allouer des sous-graphesà chaque pro...

show abstract

Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors

Cited by 22 publications

References 46 publications

Coping with silent and fail-stop errors at scale by combining replication and checkpointing

Coping with silent and fail-stop errors at scale by combining replication and checkpointing

A Generic Approach to Scheduling and Checkpointing Workflows

Checkpointing Workflows for Fail-Stop Errors

Contact Info

Product

Resources

About