As the integration level and clock speed of VLSI devices keep rising, power consumption of those devices increases dramatically. At the same time, shrinking size of transistors that enables denser and smaller chips running at faster clock speeds makes devices more susceptible to environment-induced faults. Both power reduction and concurrent error detection are becoming enabling technologies in Very Deep Sub Micron and nanometer technology domains. However, existing techniques either minimize power of "fault-free" devices, or improve fault tolerance without concerning power. Little work has been proposed to optimize the two objectives simultaneously. In this paper we attack this problem by unifying power efficiency and fault tolerance in a comprehensive Integer Linear Programming formulation. The proposed approach is tested using known benchmarks.
MotivationOver the past decades significant technological progress has been made in Very Deep Sub-Micron (VDSM) and nanometer technology domains. However, the performance improvement due to shrinking size of transistors that enables denser and smaller chips running at faster clock speeds and consuming less power has come at the cost of decreased reliability, as warned by the International Technology Roadmap of Semiconductors (ITRS). Besides inherent design defects such as ground bounce, IR drop, leakage, and charge sharing, densely packed chips are also highly susceptible to environment-induced Single Event Upsets (SEU) and Single Event Latchups (SEL). To compensate for the inevitable increase of failures, and to avoid revenue losses, yield reduction, and time-to-market slowdown, the ITRS urges "automatic insertion of robustness into the design." On the other hand, design for power efficiency is now a domain under intensive research due to increased chip density and clock frequency. Power reduction techniques can help reduce power dissipation, extend battery lifetime, improve noise margin, and reduce packaging and cooling cost.However, until recently little research has been done to jointly consider both fault tolerance and power efficiency. This is mainly due to the direct conflict between the two objectives, as fault tolerance is usually achieved through redundancy (space, time, or information), and a redundant system unavoidably consumes more power than its non-redundant version. This conflict must be addressed properly since the need for power efficiency and fault tolerance continues rising while the demand for faster and better systems never stops. However, since the techniques for the two design objectives are developed independently, they inevitably achieve one but fail the other. For example, exploiting Register-Transfer (RT) level operation-to-cycle schedule and operation-to-unit binding to reduce switching activities will tend to disturb the schedule and binding tuned for fault detection, and vice versa. Further, the existing power-reduction techniques do not exploit the unique characteristics of faults and redundant computations, and will not result in good q...