Designing a processor from the ground up to allow voltage/reliability tradeoffs

Kahng, Andrew B.; Kang, Seokhyeong; Kumar, Rakesh; Sartori, John

doi:10.1109/hpca.2010.5416652

Cited by 74 publications

(57 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, conventional processors are optimized such that all the timing paths are critical or near-critical ("timing slack wall"). This means that any time an attempt is made to reduce power by trading off reliability (by reducing voltage, for example), a catastrophically large number of timing errors is seen [20]. The slack distribution can be manipulated (for example, to make it look gradual, instead of looking like a wall (Fig.…”

Section: Implications For Circuits and Architecturesmentioning

confidence: 99%

“…Error probability can be traded off with design metrics by adjusting verification thresholds, or selective guardbanding, or selective redundancy, etc. Recent work ("stochastic computing" [79]) advocates that hardware should be allowed to produce errors even during nominal operation if such [20] design is to transform a slack distribution characterized by a critical wall into one with a more gradual failure characteristic. This allows performance/power tradeoffs over a range of error rates, whereas conventional designs are optimized for correct operation and recovery-driven designs are optimized for a specific target error rate.…”

Section: Implications For Circuits and Architecturesmentioning

confidence: 99%

“…2) Type of Operation. Erroneous operation may rely upon application's level of tolerance to limited errors (as in [18]- [20] to ensure continued operation. In contrast, error-free UnO machines correct all errors (e.g., [13]) or operate hardware within correct-operation limits (e.g., [6], [21] ).…”

Section: Uno Computing Machinesmentioning

confidence: 99%

See 2 more Smart Citations

Underdesigned and Opportunistic Computing in Presence of Hardware Variability

Gupta

Agarwal

Dolecek

et al. 2013

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

Self Cite

138

116

View full text Add to dashboard Cite

Abstract-Microelectronic circuits exhibit increasing variations in performance, power consumption, and reliability parameters across the manufactured parts and across use of these parts over time in the field. These variations have led to increasing use of overdesign and guardbands in design and test to ensure yield and reliability with respect to a rigid set of datasheet specifications. This paper explores the possibility of constructing computing machines that purposely expose hardware variations to various layers of the system stack including software. This leads to the vision of underdesigned hardware that utilizes a software stack that opportunistically adapts to a sensed or modeled hardware. The envisioned underdesigned and opportunistic computing (UnO) machines face a number of challenges related to the sensing infrastructure and software interfaces that can effectively utilize the sensory data. In this paper, we outline specific sensing mechanisms that we have developed and their potential use in building UnO machines.

show abstract

Section: Implications For Circuits and Architecturesmentioning

confidence: 99%

Section: Implications For Circuits and Architecturesmentioning

confidence: 99%

See 1 more Smart Citation

Underdesigned and Opportunistic Computing in Presence of Hardware Variability

Gupta

Agarwal

Dolecek

et al. 2013

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

Self Cite

138

116

View full text Add to dashboard Cite

show abstract

“…Inaccurate estimation can lead to over-or under-optimization. Figure 6.1 compares the error rate estimation approach proposed in Section 3.4 of this work against the result computed during functional simulation, and an estimator used by the slack optimization heuristic in [19], [17]. …”

Section: Evaluation Of Error Rate Estimationmentioning

confidence: 99%

“…To demonstrate the benefits of our recovery-driven design flow, we compare five alternative design flows -traditional P&R implementations with conventional and tight timing constraints, a BlueShift-like path constraint tuning (PCT) approach, gradual slack design [19], [17], and our heuristic for error rate-optimized recovery-driven design. Figure 6.4 compares the power consumptions of the various design techniques at several target error rates.…”

Section: Comparison Against Alternative Flowsmentioning

confidence: 99%

Recovery-Driven Design: Exploiting Error Resilience in Design of Energy-Efficient Processors

Kahng

Kang

Kumar

et al. 2012

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

View full text Add to dashboard Cite

Conventional CAD methodologies optimize a processor module for correct operation and prohibit timing violations during nominal operation. We propose recovery-driven design, a design approach that optimizes a processor module for a target timing error rate instead of correct operation. The target error rate is chosen based on how many errors can be gainfully tolerated by a hardware or software error resilience mechanism. We show that significant power benefits are possible from a recovery-driven design approach that deliberately allows errors caused by voltage overscaling to occur during nominal operation, while relying on an error resilience technique to tolerate these errors. We present a detailed evaluation and analysis of such a design-level methodology that minimizes the power of a processor module for a target error rate. We show how this design-level methodology can be extended to design recovery-driven processors -processors that are optimized to take advantage of hardware or software error resilience. These may be single-core processors or heterogeneously-reliable multi-core processors, in which individual cores are optimized for different reliability targets. We also discuss a gradual slack recovery-driven design approach that optimizes for a range of error rates to create soft processors -processors that have graceful failure characteristics and the ability to trade throughput or output quality for additional energy savings over a range of error rates. We demonstrate significant power benefits over conventional design -11.8% on average over all modules and error rate targets, and up to 29.1% for individual modules. Processorlevel benefits are 19.0%, on average. Benefits increase when recovery-driven design is coupled with an error resilience mechanism or when the number of available voltage domains increases.ii ACKNOWLEDGMENTS

show abstract