2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) 2017
DOI: 10.1109/ccgrid.2017.40
|View full text |Cite
|
Sign up to set email alerts
|

Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications

Abstract: Abstract-Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop errors and studies that address SDCs. However few studies address both types of errors together. In this paper we propose a software-based selective replication technique for HPC applications for both fail-stop errors and SDCs. Since complete replication of applications can be costly in terms of resources, we develop a runtime… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2017
2017
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 30 publications
(11 citation statements)
references
References 18 publications
0
11
0
Order By: Relevance
“…It differs from this work in that they limit themselves to perfectly parallel applications while we investigate per task speedup profiles that obey Amdahl's law. More recently, Subasi et al [33] proposed a software-based selective replication of task-parallel applications used for both fail-stop and silent errors. In contrast, this work (i) considers dependent tasks such as found in applications consisting of linear workflows; and (ii) proposes an optimal dynamic programming algorithm to solve the selective replication and checkpointing problem.…”
Section: Replicationmentioning
confidence: 99%
“…It differs from this work in that they limit themselves to perfectly parallel applications while we investigate per task speedup profiles that obey Amdahl's law. More recently, Subasi et al [33] proposed a software-based selective replication of task-parallel applications used for both fail-stop and silent errors. In contrast, this work (i) considers dependent tasks such as found in applications consisting of linear workflows; and (ii) proposes an optimal dynamic programming algorithm to solve the selective replication and checkpointing problem.…”
Section: Replicationmentioning
confidence: 99%
“…Partial redundancy Partial redundancy has been studied to decrease the overhead of complete redundancy [22,[46][47][48]. Adaptive partial redundancy has also been proposed wherein a subset of processes is dynamically selected for replication [30].…”
Section: Related Workmentioning
confidence: 99%
“…Girault and Kalla [45] propose an exponential-time algorithm for bi-criteria multiprocessor scheduling which returns a static schedule for the input DAG under upper bound constraints on the application execution time and on the global system failure rate. Subasi et al [46] use partial replication to improve the reliability of an application in presence of silent and fail-stop errors. Works that optimize reliability do not guarantee that all executions will eventually succeed (because, for instance, not all failure patterns are covered by the chose replication scheme).…”
Section: Soft and Silent Errorsmentioning
confidence: 99%