Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications

Subaşi, Ömer; Yalcin, Gulay; Zyulkyarov, Ferad; Ünsal, Osman; Labarta, Jesús

doi:10.1109/ccgrid.2017.40

Cited by 30 publications

(11 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It differs from this work in that they limit themselves to perfectly parallel applications while we investigate per task speedup profiles that obey Amdahl's law. More recently, Subasi et al [33] proposed a software-based selective replication of task-parallel applications used for both fail-stop and silent errors. In contrast, this work (i) considers dependent tasks such as found in applications consisting of linear workflows; and (ii) proposes an optimal dynamic programming algorithm to solve the selective replication and checkpointing problem.…”

Section: Replicationmentioning

confidence: 99%

Combining Checkpointing and Replication for Reliable Execution of Linear Workflows

Benoît

Cavelan

Ciorba

et al. 2018

2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

View full text Add to dashboard Cite

This report combines checkpointing and replication for the reliable execution of linear workflows. While both methods have been studied separately, their combination has not yet been investigated despite its promising potential to minimize the execution time of linear workflows in failure-prone environments. The combination raises new problems: for each task, we have to decide whether to checkpoint and/or replicate it. We provide an optimal dynamic programming algorithm of quadratic complexity to solve both problems. This dynamic programming algorithm has been validated through extensive simulations that reveal the conditions in which checkpointing only, replication only, or the combination of both techniques lead to improved performance.

show abstract

Section: Replicationmentioning

confidence: 99%

Combining Checkpointing and Replication for Reliable Execution of Linear Workflows

Benoît

Cavelan

Ciorba

et al. 2018

2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

View full text Add to dashboard Cite

show abstract

“…Partial redundancy Partial redundancy has been studied to decrease the overhead of complete redundancy [22,[46][47][48]. Adaptive partial redundancy has also been proposed wherein a subset of processes is dynamically selected for replication [30].…”

Section: Related Workmentioning

confidence: 99%

Detection of Silent Data Corruptions in Smoothed Particle Hydrodynamics Simulations

Cavelan

Cabezón

Ciorba

2019

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

View full text Add to dashboard Cite

Silent data corruptions (SDCs) hinder the correctness of long-running scientific applications on large scale computing systems. Selective particle replication (SPR) is proposed herein as the first particle-based replication method for detecting SDCs in Smoothed particle hydrodynamics (SPH) simulations. SPH is a mesh-free Lagrangian method commonly used to perform hydrodynamical simulations in astrophysics and computational fluid dynamics. SPH performs interpolation of physical properties over neighboring discretization points (called SPH particles) that dynamically adapt their distribution to the mass density field of the fluid. When a fault (e.g., a bit-flip) strikes the computation or the data associated with a particle, the resulting error is silently propagated to all nearest neighbors through such interpolation steps. SPR replicates the computation and data of a few carefully selected SPH particles. SDCs are detected when the data of a particle differs, due to corruption, from its replicated counterpart. SPR is able to detect many DRAM SDCs as they propagate by ensuring that all particles have at least one neighbor that is replicated. The detection capabilities of SPR were assessed through a set of error-injection and detection experiments and the overhead of SPR was evaluated via a set of strong-scaling experiments conducted on an HPC system. The results show that SPR achieves detection rates of 91-99.9%, no false-positives, at an overhead of 1-10%.

show abstract

“…Girault and Kalla [45] propose an exponential-time algorithm for bi-criteria multiprocessor scheduling which returns a static schedule for the input DAG under upper bound constraints on the application execution time and on the global system failure rate. Subasi et al [46] use partial replication to improve the reliability of an application in presence of silent and fail-stop errors. Works that optimize reliability do not guarantee that all executions will eventually succeed (because, for instance, not all failure patterns are covered by the chose replication scheme).…”

Section: Soft and Silent Errorsmentioning

confidence: 99%

Checkpointing Workflows for Fail-Stop Errors

Han

Canon

Casanova

et al. 2018

IEEE Trans. Comput.

View full text Add to dashboard Cite

International audienceWe consider the problem of orchestrating the execution of workflow applications structured as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail-stop failures. The objective is to minimize expected overall execution time, or makespan. A solution to this problem consists of a schedule of the workflow tasks on the available processors and of a decision of which application data to checkpoint to stable storage, so as to mitigate the impact of processor failures. To address this challenge, we consider a restricted class of graphs, Minimal Series-Parallel Graphs (M-SPGS), which is relevant to many real-world workflow applications. For this class of graphs, we propose a recursive list-scheduling algorithm that exploits the M-SPG structure to assign sub-graphs to individual processors, and uses dynamic programming to decide how to checkpoint these sub-graphs. We assess the performance of our algorithm for production workflow configurations, comparing it to an approach in which all application data is checkpointed and an approach in which no application data is checkpointed. Results demonstrate that our algorithm outperforms both the former approach, because of lower checkpointing overhead, and the latter approach, because of better resilience to failures

show abstract

Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications

Cited by 30 publications

References 18 publications

Combining Checkpointing and Replication for Reliable Execution of Linear Workflows

Combining Checkpointing and Replication for Reliable Execution of Linear Workflows

Detection of Silent Data Corruptions in Smoothed Particle Hydrodynamics Simulations

Checkpointing Workflows for Fail-Stop Errors

Contact Info

Product

Resources

About