A Mechanism for Online Diagnosis of Hard Faults in Microprocessors

Bower, Fred A.; Sorin, Daniel J.; Ozev, Sule

doi:10.1109/micro.2005.8

Cited by 84 publications

(55 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…An existing scheme that can be used for this purpose is the hard-fault detection and diagnosis framework described in [36], involving a low-cost hardware checker [38] and saturating counters. This section provides a brief outline of the methodology in [36] for the sake of completeness. The work here has no contributions towards this end.…”

Section: Architectural Pre-requisitesmentioning

confidence: 99%

“…If an instruction result is found to be erroneous, the faulty FDU in use is recorded by incrementing a saturating counter corresponding to each and every FDU used by the instruction. If the fault-count for an FDU rises beyond a threshold within a pre-specified time interval, the fault in that unit is considered to be permanent [36]. Experimental results indicate that most hard faults can be suitably detected and diagnosed within a few thousand instructions after the faults develop.…”

Section: Online Detection and Diagnosismentioning

confidence: 99%

See 1 more Smart Citation

A Hardware Framework for Yield and Reliability Enhancement in Chip Multiprocessors

Pan

Rodrigues

Kundu

2015

ACM Trans. Embed. Comput. Syst.

View full text Add to dashboard Cite

Device reliability and manufacturability have emerged as dominant concerns in end-of-road CMOS devices. An increasing number of hardware failures are attributed to manufacturability or reliability problems. Maintaining an acceptable manufacturing yield for chips containing tens of billions of transistors with wide variations in device parameters has been identified as a great challenge. Additionally, today’s nanometer scale devices suffer from accelerated aging effects because of the extreme operating temperature and electric fields they are subjected to. Unless addressed in design, aging-related defects can significantly reduce the lifetime of a product. In this article, we investigate a micro-architectural scheme for improving yield and reliability of homogeneous chip multiprocessors (CMPs). The proposed solution involves a hardware framework that enables us to utilize the redundancies inherent in a multicore system to keep the system operational in the face of partial failures. A micro-architectural modification allows a faulty core in a CMP to use another core’s resources to service any instruction that the former cannot execute correctly by itself. This service improves yield and reliability but may cause loss of performance. The target platform for quantitative evaluation of performance under degradation is a dual-core and a quad-core chip multiprocessor with one or more cores sustaining partial failure. Simulation studies indicate that when a large, high-latency, and sparingly used unit such as a floating-point unit fails in a core, correct execution may be sustained through outsourcing with at most a 16% impact on performance for a floating-point intensive application. For applications with moderate floating-point load, the degradation is insignificant. The performance impact may be mitigated even further by judicious selection of the cores to commandeer depending on the current load on each of the candidate cores. The area overhead is also negligible due to resource reuse.

show abstract

Section: Architectural Pre-requisitesmentioning

confidence: 99%

Section: Online Detection and Diagnosismentioning

confidence: 99%

A Hardware Framework for Yield and Reliability Enhancement in Chip Multiprocessors

Pan

Rodrigues

Kundu

2015

ACM Trans. Embed. Comput. Syst.

View full text Add to dashboard Cite

show abstract

“…One proposal for run-time isolation is BlackJack [20], which exploits simultaneously-redundant threads on an SMT, previously used to detect soft errors, to detect defects. Bower et al in [3] propose using DIVA-checkers, small auxiliary cores that check committed instructions [1], for defect isolation. Constantinides et al in [6] propose a virtualization layer between the operating system and the hardware to introduce periodic special instructions for defect isolation.…”

Section: Defect Detection and Isolationmentioning

confidence: 99%

Architectural core salvaging in a multi-core processor for hard-error tolerance

Powell

Biswas

Gupta

et al. 2009

Proceedings of the 36th Annual International Symposium on Computer Architecture

View full text Add to dashboard Cite

show abstract

“…The final phase (Step 3) of the test routine uses the ACE get instruction to read and validate the test response from the scan state. If a test pattern fails to produce the correct response at the end of Step 3, the test program indicates which part of the hardware is defective 5 and disables it through system reconfiguration [27,8]. If necessary, the test program can run additional test patterns to narrow down the defective part to a finer granularity.…”

Section: Ace-based Online Testingmentioning

confidence: 99%

Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation

Constantinides

Mutlu²,

Austin

et al. 2007

40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007)

View full text Add to dashboard Cite

As silicon process technology scales deeper into the nanometer regime, hardware defects are becoming more common. Such defects are bound to hinder the correct operation of future processor systems, unless new online techniques become available to detect and to tolerate them while preserving the integrity of software applications running on the system. This paper proposes a new, software-based, defect detection and diagnosis technique. We introduce a novel set of instructions, called Access-Control Extension (ACE), that can access and control the microprocessor's internal state. Special firmware periodically suspends microprocessor execution and uses the ACE instructions to run directed tests on the hardware. When a hardware defect is present, these tests can diagnose and locate it, and then activate system repair through resource reconfiguration. The software nature of our framework makes it flexible: testing techniques can be modified/upgraded in the field to trade off performance with reliability without requiring any change to the hardware.We evaluated our technique on a commercial chip-multiprocessor based on Sun's Niagara and found that it can provide very high coverage, with 99.22% of all silicon defects detected. Moreover, our results show that the average performance overhead of softwarebased testing is only 5.5%. Based on a detailed RTL-level implementation of our technique, we find its area overhead to be quite modest, with only a 5.8% increase in total chip area.

show abstract

A Mechanism for Online Diagnosis of Hard Faults in Microprocessors

Cited by 84 publications

References 33 publications

A Hardware Framework for Yield and Reliability Enhancement in Chip Multiprocessors

A Hardware Framework for Yield and Reliability Enhancement in Chip Multiprocessors

Architectural core salvaging in a multi-core processor for hard-error tolerance

Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation

Contact Info

Product

Resources

About