The increasing size and complexity of HighPerformance Computing systems is making it increasingly likely that individual circuits will produce erroneous results, especially when operated in a low energy mode. Previous techniques for Algorithm -Based Fault Tolerance (ABFT) [20] have been proposed for detecting errors in dense linear operations, but have high overhead in the context of sparse problems. In this paper, we propose a set of algorithmic techniques that minimize the overhead of fault detection for sparse problems. The techniques are based on two insights. First, many sparse problems are well structured (e.g. diagonal, banded diagonal, block diagonal), which allows for sampling techniques to produce good approximations of the checks used for fault detection. These approximate checks may be acceptable for many sparse linear algebra applications. Second, many linear applications have enough reuse that preconditioning techniques can be used to make these applications more amenable to low-cost algorithmic checks. The proposed techniques are shown to yield up to 2x reductions in performance overhead over traditional ABFT checks for a spectrum of sparse problems. A case study using common linear solvers further illustrates the benefits of the proposed algorithmic techniques.Index Terms-ABFT, sparse linear algebra, numerical methods, error detection
There have been several attempts at correcting process vari ation induced errors by identifying and masking these er rors at the circuit and architecture level [10,271. These ap proaches take up valuable die area and power on the chip. As an alternative, we explore the feasibility of an approach that allows these errors to occur freely, and handle them in software, at the algorithmic level. In this paper, we present a general approach to converting applications into an error tolerant form by recasting these applications as numerical optimization problems, which can then be solved reliably via stochastic optimization. We evaluate the potential ro bustness and energy benefits of the proposed approach us ing an FPGA-based framework that emulates timing errors in the floating point unit (FPU) of a Leon3 processor [111. We show that stochastic versions of applications have the potential to produce good quality outputs in the face of tim ing errors under certain assumptions. We also show that good quality results are possible for both intrinsically ro bust algorithms as well as fragile applications under these assumptions.
Much recent research [8,6,7] suggests significant power and energy benefits of relaxing correctness constraints in future processors. Such processors with relaxed constraints have often been referred to as stochastic processors [10,15,11]. In this paper we present three approaches for building applications for such processors. The first approach relies on relaxing the correctness of the application based upon an analysis of application characteristics. The second approach relies upon detecting and then correcting faults within the application as they arise. The third approach transforms applications into more error tolerant forms. In this paper, we show how these techniques that enhance or exploit the error tolerance of applications can yield significant power and energy benefits when computed on stochastic processors.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.