This paper presents an overview of Phaser, a toolset and methodology for modeling the effects of soft errors on the architectural and microarchitectural functionality of a system. The Phaser framework is used to understand the system-level effects of soft-error rates of a microprocessor chip as its design evolves through the phases of preconcept, concept, high-level design, and register-transfer-level design implementation. Phaser represents a strategic research vision that is being proposed as a nextgeneration toolset for predicting chip-level failure rates and studying reliability-performance tradeoffs during the phased design process. This paper primarily presents Phaser/M1, the early stage of the predictive modeling of behavior.
IntroductionAs the trend toward smaller device and wire dimensions continues, we are entering an era of increased chip integration, reduced supply voltages, and higher frequencies. An inescapable consequence of this development is the fact that transient, or soft, errors will continue to be a serious threat to the general technology of robust computing. Soft errors may be caused by various events including neutrons from cosmic particle incidence, alpha-particles from trace radioactive content in packaging materials, and inductive noise effects (Ldi/dt) on the chip supply voltage resulting from aggressive forms of dynamic power management.As technology scales from 65 nm toward 45 nm and beyond, current soft-error rate (SER) projections for SRAM cells and latch and logic elements are noteworthy. As Borkar [1] has discussed, the SER per-bit rate for SRAM cells appears to be leveling off, but the bit count per chip is increasing exponentially in accordance with Moore's Law; latch SER is catching up with SRAM perbit rates with a steeper slope of increase; and logic combinational SER is projected to increase at a much faster pace, although the absolute numbers are significantly smaller than SRAM or latch numbers at the present time. For silicon-on-insulator technology, going forward from 65 nm to 45 nm, the latch SER per-bit rate is predicted to increase from two to five times [2], and the number of latches per chip is expected to increase with integration density. Again, storage-cell SER will still dominate, and latch errors will also be of increasing relevance at 45-nm technologies and beyond.Chip design must begin with a consideration of systemlevel mean-time-to-failure (MTTF) targets, and the design methodology must be able to estimate or set bounds for chip-level failure rates with reasonable accuracy in order to avoid in-field system quality problems. A balanced combination of circuit-and logiclevel innovations and architecture-and software-level solutions is necessary to achieve the required resiliency to single-event upsets (SEUs). In particular, there is the need for a comprehensive understanding of the vulnerabilities associated with various units on the chip with regard to workload behavior. When such information is available, appropriate approaches-such as selective duplication,...