Continued increasing of fault rate in integrate circuit makes processors more susceptible to errors, especially many-core processor. Meanwhile, most systems or applications do not need full fault coverage, which has excessive overhead. So on-demand fault tolerance is desired for these applications. In this paper, we propose an adaptive low-overhead fault tolerance mechanism for many-core system, called Device View Redundancy (DVR). It treats fault tolerance as a device that can be configured and used by application when high reliability is needed. Nevertheless, DVR exploits the idle resources for lowoverhead fault tolerance, which is based on the observation that the utilization of many-core system is low due to lack of parallelism in application. Finally, the experiment shows that the performance overhead of DVR is reduced by 16%to 98% compared with full Dual Modular Redundancy (DMR).
Soft errors are increasingly important threats to the reliability of integrated circuits. Chips manufactured in advanced technologies show variations in SER caused by variations in the process parameters. Ongoing reduction of feature sizes and complexity of operating environment (temperature, voltage, radiation pressure and so on), SER variation is increasingly manifesting. Checkpoint is one of the most popular recovery method used for many systems, and the intervals of checkpoint can obviously influence performance. However, optimal intervals of checkpoint rely on SER. Theoretically speaking, SER adaptive checkpoint (SACP) which dynamically match checkpoint intervals with real time SER can improve checkpoint overhead under variable SER. But benefits of SACP are relative with SER variation. We give a mathematical model of SER variation and proposal a way to predict SER based errors occurred most currently. Results show high accuracy of SER prediction and much overhead improvement of SACP.
Abstract. Future many-core processors may contain more than 1000 cores on single die. However, continued scaling of silicon fabrication technology exposes chip orders of such magnitude to a higher vulnerability to errors. A low-overhead and adaptive fault-tolerance mechanism is desired for general-purpose many-core processors. We propose high-level adaptive redundancy (HLAR), which possesses several unique properties. First, the technique employs selective redundancy based application assistance and dynamically cores schedule. Second, the method requires minimal overhead when the mechanism is disabled. Third, it expands the local memory within the replication sphere, which heightens the replication level and simplifies the redundancy mechanism. Finally, it decreases bandwidth through various compression methods, thus effectively balancing reliability, performance, and power. Experimental results show a remarkably low overhead while covering 99.999% errors with only 0.25% more networks-on-chip traffic.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.