Reliability for multi-core processors has emerged as an important design constraint. A key research challenge is to detect and/or mitigate transient faults, such as soft errors, that can abruptly terminate an executing application or generate incorrect output, both leading to undesirable effects that can potentially be catastrophic in safety-critical systems. State-of-the-art reliability techniques and mechanisms deploy full-scale redundancy, like double or triple modular redundancy (DMR, TMR), on different layers of the computing stack to detect and/or correct such transient faults. However, the techniques relying on fullscale redundancy incur significant area, performance, and/or power overheads, which might not always be feasible/practical due to system constraints such as deadlines and available power budget for the full chip (or a processor core). Moreover, depending on the inherent resilience of an application, not every application requires full-scale redundancy, that would, otherwise, result in resource/energy wastage. Hence, techniques relying on selective redundancy have recently been investigated by researchers.In this work, we propose a novel design methodology to generate and explore the architectural-space of heterogeneous reliability modes for out-of-order superscalar multi-core processors. These heterogeneous modes are iso-ISA (i.e., implement the same Instruction Set Architecture), but differ in terms of the microarchitectural implementation, i.e., different components are hardened with different reliability techniques. Hence, these heterogeneous modes enable varying reliability and power/area trade-offs, from which an optimal configuration can be chosen at run time to meet the reliability requirements of a given system, while reducing the corresponding power overheads (or alternatively solving the inverse problem, i.e., maximizing the reliability under a given power constraint). We implemented different reliability modes for the ALPHA 21264 out-of-order superscalar microprocessor, and integrated different cores with heterogeneous reliability modes in a multi-core configuration. Our experimental results show that a pareto-optimal heterogeneous reliability mode reduces the core vulnerability by 87%, on average, across multiple application workloads, with area and power overheads of 10% and 43%, respectively.To further enhance the design space of heterogeneous reliability modes, we investigate the effectiveness of combining different processor state compression techniques like Distributed Multi-threaded Checkpointing (DMTCP), Hash-based Incremental Checkpointing (HBICT) and GNU zip, such that the correct processor state can be recovered once a fault is detected. These state compression techniques aim at reducing the storage requirements of the processors' correct state, which is backed-up at an application checkpoint during its execution to ensure successful recovery. We reduced the checkpoint sizes by a factor of~6× using a unique combination of different state compression techniques. To validate ...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.