2015 IEEE International Conference on Cluster Computing 2015
DOI: 10.1109/cluster.2015.36
|View full text |Cite
|
Sign up to set email alerts
|

DINO: Divergent Node Cloning for Sustained Redundancy in HPC

Abstract: A plethora of resilience techniques have been investigated ranging from checkpoint/restart over redundancy to algorithm-based fault tolerance. Each technique works well for a different subset of application kernels, and depending on the kernel, has different overheads, resource requirements, and fault masking capabilities. If, however, such techniques are combined and they interact across kernels, new vulnerability windows are created.This work contributes the idea of end-to-end resilience by protecting window… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
0
0

Year Published

2018
2018
2020
2020

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
references
References 31 publications
(36 reference statements)
0
0
0
Order By: Relevance