Algorithmic approaches to low overhead fault detection for sparse linear algebra

Sloan, Joseph; Kumar, Rakesh; Bronevetsky, Greg

doi:10.1109/dsn.2012.6263938

Cited by 74 publications

(47 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…By decoupling, we assume a more relaxed model that allows incorrect results to be written. With algorithmic detection techniques [51], we have no natural instruction-precise "immediate" retry so we must identify a well defined point. Also, this decoupling allows our results to consider techniques like Truffle [16], which steers instructions into low-energy approximate or high-energy precise pipelines.…”

Section: F Discussionmentioning

confidence: 99%

Mechanisms and Evaluation of Cross-Layer Fault-Tolerance for Supercomputing

Kruijf

Sankaralingam

et al. 2012

2012 41st International Conference on Parallel Processing

View full text Add to dashboard Cite

Abstract-Reliability is emerging as an important constraint for future microprocessors. Cooperative hardware and software approaches for error tolerance can solve this hardware reliability challenge. Cross-layer fault tolerance frameworks expose hardware failures to upper-layers, like the compiler, to help correct faults. Such cooperative approaches require less hardware complexity than masking all faults at the hardware level and are generally more energy efficient.This paper provides a detailed design and an implementation study of cross-layer fault tolerance for supercomputing. Since supercomputers necessarily involve large component counts, they have more frequent failures than consumer electronics and small systems. Conventionally, these systems use redundancy and checkpointing to achieve reliable computing. However, redundancy increases acquisition as well as recurring energy costs. This paper describes a simple language-level mechanism coupled with complementary compilation and lightweight hardware error detection that provides efficient reliability and cross-layer faulttolerance for supercomputers. Our evaluation focuses on strong scaling problems for which we can trade computing power for redundancy. Our results show a range of 1.07x to 2.5x speedup when employing cross-layer error-tolerance compared to conventional full dual modular redundancy (DMR) to contain all errors within hardware. Further, we demonstrate the approach can sustain 7% to 50% lower energy. The most important result of this work is qualitative: we can use a simplified hardware design with relaxed architectural correctness guarantees.

show abstract

Section: F Discussionmentioning

confidence: 99%

Mechanisms and Evaluation of Cross-Layer Fault-Tolerance for Supercomputing

Kruijf

Sankaralingam

et al. 2012

2012 41st International Conference on Parallel Processing

View full text Add to dashboard Cite

show abstract

“…Probabilistic modeling has been used by Chung et al [7] to help compute the expected recovery time, which cannot be measured easily for very large scale programs. Sloan et al [30] have discussed the use of algorithmic checks over sparse linear algebra kernels and focused mainly on reducing false positive and false negative in error detection.…”

Section: Introductionmentioning

confidence: 99%

A framework for evaluating comprehensive fault resilience mechanisms in numerical programs

Chen

Bronevetsky

et al. 2015

J Supercomput

Self Cite

View full text Add to dashboard Cite

As HPC systems approach Exascale, their circuit features will shrink while their overall size will grow, both at a fixed power limit. These trends imply that soft faults in electronic circuits will become an increasingly significant problem for programs that run on these systems, causing them to occasionally crash or worse, silently return incorrect results. This is motivating extensive work on program resilience to such faults, ranging from generic mechanisms such as replication or checkpoint/restart to algorithm-specific error detection and resilience mechanisms. Effective use of such mechanisms requires a detailed understanding of (1) which vulnerable parts of the program are most worth protecting and (2) the performance and resilience impact of fault resilience mechanisms on the program. This paper presents FaultTelescope, a tool that combines these two and generates actionable insights by presenting program vulnerabilities and impact of fault resilience mechanisms in an intuitive way.

show abstract

“…Algorithm-based fault tolerance (ABFT) [19] is a superior example of customized protection, because it offers high detection coverage and low runtime overhead in fundamental linear algebra operations and other matrix operations (e.g., LU decomposition [20], sparse matrix multiplication [21], and iterative liner solvers [22]). With ABFT, rewriting of an application program usually takes a large amount of manual effort, while our approach automatically customizes error detectors by itself using an unsupervised learning procedure once the protection target states (which are usually GPU kernel output data) have been identified by users.…”

Section: Introductionmentioning

confidence: 99%

Characterization of Impact of Transient Faults and Detection of Data Corruption Errors in Large-Scale N-Body Programs Using Graphics Processing Units

Yim

2014

2014 IEEE 28th International Parallel and Distributed Processing Symposium

View full text Add to dashboard Cite

In N-body programs, trajectories of simulated particles have chaotic patterns if errors are in the initial conditions or occur during some computation steps. It was believed that the global properties (e.g., total energy) of simulated particles are unlikely to be affected by a small number of such errors. In this paper, we present a quantitative analysis of the impact of transient faults in GPU devices on a global property of simulated particles. We experimentally show that a single-bit error in non-control data can change the final total energy of a largescale N-body program with ~2.1% probability. We also find that the corrupted total energy values have certain biases (e.g., the values are not a normal distribution), which can be used to reduce the expected number of re-executions. In this paper, we also present a data error detection technique for N-body programs by utilizing two types of properties that hold in simulated physical models. The presented technique and an existing redundancy-based technique together cover many data errors (e.g., >97.5%) with a small performance overhead (e.g., 2.3%).

show abstract

Algorithmic approaches to low overhead fault detection for sparse linear algebra

Cited by 74 publications

References 17 publications

Mechanisms and Evaluation of Cross-Layer Fault-Tolerance for Supercomputing

Mechanisms and Evaluation of Cross-Layer Fault-Tolerance for Supercomputing

A framework for evaluating comprehensive fault resilience mechanisms in numerical programs

Characterization of Impact of Transient Faults and Detection of Data Corruption Errors in Large-Scale N-Body Programs Using Graphics Processing Units

Contact Info

Product

Resources

About