IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012) 2012
DOI: 10.1109/dsn.2012.6263938
|View full text |Cite
|
Sign up to set email alerts
|

Algorithmic approaches to low overhead fault detection for sparse linear algebra

Abstract: The increasing size and complexity of HighPerformance Computing systems is making it increasingly likely that individual circuits will produce erroneous results, especially when operated in a low energy mode. Previous techniques for Algorithm -Based Fault Tolerance (ABFT) [20] have been proposed for detecting errors in dense linear operations, but have high overhead in the context of sparse problems. In this paper, we propose a set of algorithmic techniques that minimize the overhead of fault detection for spa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
47
0

Year Published

2012
2012
2021
2021

Publication Types

Select...
4
3
3

Relationship

1
9

Authors

Journals

citations
Cited by 74 publications
(47 citation statements)
references
References 17 publications
0
47
0
Order By: Relevance
“…By decoupling, we assume a more relaxed model that allows incorrect results to be written. With algorithmic detection techniques [51], we have no natural instruction-precise "immediate" retry so we must identify a well defined point. Also, this decoupling allows our results to consider techniques like Truffle [16], which steers instructions into low-energy approximate or high-energy precise pipelines.…”
Section: F Discussionmentioning
confidence: 99%
“…By decoupling, we assume a more relaxed model that allows incorrect results to be written. With algorithmic detection techniques [51], we have no natural instruction-precise "immediate" retry so we must identify a well defined point. Also, this decoupling allows our results to consider techniques like Truffle [16], which steers instructions into low-energy approximate or high-energy precise pipelines.…”
Section: F Discussionmentioning
confidence: 99%
“…Probabilistic modeling has been used by Chung et al [7] to help compute the expected recovery time, which cannot be measured easily for very large scale programs. Sloan et al [30] have discussed the use of algorithmic checks over sparse linear algebra kernels and focused mainly on reducing false positive and false negative in error detection.…”
Section: Introductionmentioning
confidence: 99%
“…Algorithm-based fault tolerance (ABFT) [19] is a superior example of customized protection, because it offers high detection coverage and low runtime overhead in fundamental linear algebra operations and other matrix operations (e.g., LU decomposition [20], sparse matrix multiplication [21], and iterative liner solvers [22]). With ABFT, rewriting of an application program usually takes a large amount of manual effort, while our approach automatically customizes error detectors by itself using an unsupervised learning procedure once the protection target states (which are usually GPU kernel output data) have been identified by users.…”
Section: Introductionmentioning
confidence: 99%