Proceedings of the 12th ACM International Conference on Computing Frontiers 2015
DOI: 10.1145/2742854.2742903
|View full text |Cite
|
Sign up to set email alerts
|

Programmer-directed partial redundancy for resilient HPC

Abstract: In this work we propose partial task replication and check-pointing for task-parallel HPC applications to mitigate silent data corruption (SDC) errors. As the complete replication of all application tasks can be prohibitive due to resource costs, we introduce programmer-directed selective replication mechanism to provide fault-tolerance while decreasing costs. Results show that our scheme detects and corrects around 65% of SDC errors with only 4% overhead on average.Peer ReviewedPostprint (published version

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
15
0

Year Published

2016
2016
2018
2018

Publication Types

Select...
4
2
1

Relationship

4
3

Authors

Journals

citations
Cited by 21 publications
(15 citation statements)
references
References 6 publications
0
15
0
Order By: Relevance
“…Partial redundancy is studied in [13,31,32] (in combination with coordinated checkpointing) to decrease the overhead of full replication. Adaptive redundancy is introduced in [19], where a subset of processes is dynamically selected for replication.…”
Section: Replicationmentioning
confidence: 99%
“…Partial redundancy is studied in [13,31,32] (in combination with coordinated checkpointing) to decrease the overhead of full replication. Adaptive redundancy is introduced in [19], where a subset of processes is dynamically selected for replication.…”
Section: Replicationmentioning
confidence: 99%
“…Although partial replication has been empirically studied by some previous work [23,44,45], designing an optimal strategy that combines partial redundancy and checkpointing and analyzing its efficacy remain to be done.…”
Section: Resultsmentioning
confidence: 99%
“…), and a set of experimental and simulation results. Partial redundancy is studied in [23,44,45] (in combination with coordinated checkpointing) to decrease the overhead of full replication. Adaptive redundancy is introduced in [29], where a subset of processes is dynamically selected for replication.…”
Section: Replication For Fail-stop Errorsmentioning
confidence: 99%
“…In this work we model HPC application reliability formally by Markov chains and we propose a dynamic runtime heuristic for utilizing idle resources to maximize the reliability of HPC applications. Our work [16] proposes a programmer-guided partial redundancy mechanism for SDCs and fail-stop errors.…”
Section: Related Workmentioning
confidence: 99%