Joseph Sloan scite author profile

Bronevetsky

2012

The increasing size and complexity of HighPerformance Computing systems is making it increasingly likely that individual circuits will produce erroneous results, especially when operated in a low energy mode. Previous techniques for Algorithm -Based Fault Tolerance (ABFT) [20] have been proposed for detecting errors in dense linear operations, but have high overhead in the context of sparse problems. In this paper, we propose a set of algorithmic techniques that minimize the overhead of fault detection for sparse problems. The techniques are based on two insights. First, many sparse problems are well structured (e.g. diagonal, banded diagonal, block diagonal), which allows for sampling techniques to produce good approximations of the checks used for fault detection. These approximate checks may be acceptable for many sparse linear algebra applications. Second, many linear applications have enough reuse that preconditioning techniques can be used to make these applications more amenable to low-cost algorithmic checks. The proposed techniques are shown to yield up to 2x reductions in performance overhead over traditional ABFT checks for a spectrum of sparse problems. A case study using common linear solvers further illustrates the benefits of the proposed algorithmic techniques.Index Terms-ABFT, sparse linear algebra, numerical methods, error detection

show abstract

An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance

Bronevetsky

2013

A numerical optimization-based methodology for application robustification: Transforming applications for error tolerance

Kesler

et al. 2010

There have been several attempts at correcting process vari ation induced errors by identifying and masking these er rors at the circuit and architecture level [10,271. These ap proaches take up valuable die area and power on the chip. As an alternative, we explore the feasibility of an approach that allows these errors to occur freely, and handle them in software, at the algorithmic level. In this paper, we present a general approach to converting applications into an error tolerant form by recasting these applications as numerical optimization problems, which can then be solved reliably via stochastic optimization. We evaluate the potential ro bustness and energy benefits of the proposed approach us ing an FPGA-based framework that emulates timing errors in the floating point unit (FPU) of a Leon3 processor [111. We show that stochastic versions of applications have the potential to produce good quality outputs in the face of tim ing errors under certain assumptions. We also show that good quality results are possible for both intrinsically ro bust algorithms as well as fragile applications under these assumptions.

show abstract

On software design for stochastic processors

Sartori

2012

Much recent research [8,6,7] suggests significant power and energy benefits of relaxing correctness constraints in future processors. Such processors with relaxed constraints have often been referred to as stochastic processors [10,15,11]. In this paper we present three approaches for building applications for such processors. The first approach relies on relaxing the correctness of the application based upon an analysis of application characteristics. The second approach relies upon detecting and then correcting faults within the application as they arise. The third approach transforms applications into more error tolerant forms. In this paper, we show how these techniques that enhance or exploit the error tolerance of applications can yield significant power and energy benefits when computed on stochastic processors.

show abstract

Towards scalable reliability frameworks for error prone CMPs

2009