Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis 2013
DOI: 10.1145/2503210.2503226
|View full text |Cite
|
Sign up to set email alerts
|

Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach

Abstract: Algorithm-based fault tolerance (ABFT) is a highly efficient resilience solution for many widely-used scientific computing kernels. However, in the context of the resilience ecosystem, ABFT is completely opaque to any underlying hardware resilience mechanisms. As a result, some data structures are over-protected by ABFT and hardware, which leads to redundant costs in terms of performance and energy. In this paper, we rethink ABFT using an integrated view including both software and hardware with the goal of im… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
18
0

Year Published

2014
2014
2018
2018

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 36 publications
(18 citation statements)
references
References 40 publications
(46 reference statements)
0
18
0
Order By: Relevance
“…We enrich the FRAM model with a safe memory of arbitrary size S and then give evidence that an increased safe memory can be exploited to notably improve the performance of resilient algorithms. In addition to its theoretical interest, the adoption of such a model is supported by recent research on hybrid systems that integrate algorithmic resiliency with the (limited) amount of memory protected by hardware ECC [17]. In this setting, S would denote the memory that is protected by the hardware.…”
Section: Our Resultsmentioning
confidence: 99%
“…We enrich the FRAM model with a safe memory of arbitrary size S and then give evidence that an increased safe memory can be exploited to notably improve the performance of resilient algorithms. In addition to its theoretical interest, the adoption of such a model is supported by recent research on hybrid systems that integrate algorithmic resiliency with the (limited) amount of memory protected by hardware ECC [17]. In this setting, S would denote the memory that is protected by the hardware.…”
Section: Our Resultsmentioning
confidence: 99%
“…With costs between 40%-85% vs 200%, our results may indicate an opportunity to disable ECC for DRAM and rather run ECC-unprotected when FlipSphere provides protection for kernels (not just for ECC-detectable events but also extra protection against silent data corruption), particularly since turning off ECC may result in lower memory latency and power consumption; yet, in contrast to Li at al. [23], FlipSphere does not require algorithmic changes.…”
Section: Discussionmentioning
confidence: 99%
“…The design of solutions that combine capabilities across different layers of the system stack has also been previously explored, but using ad-hoc methods. For example, using the ABFT technique to protect application data structures permits different ECC mechanisms for different page frames in memory [13]. To deal with fail-stop and silent errors simultaneously, recent work has proposed combining ABFT methods with system-based checkpointing [2], in which each computational phase is followed by ABFT verification for SDCs and an in-memory checkpoint.…”
Section: Related Workmentioning
confidence: 99%