A dynamic reconfiguration mechanism to increase the reliability of GPGPUs

Condia, Josie E. Rodriguez; Narducci, Pierpaolo; Reorda, M. Sonza; Sterpone, Luca

doi:10.1109/vts48691.2020.9107572

Cited by 4 publications

(2 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Their higher complexity and the huge amount of computing units make GPU hardening more challenging than for CPUs. Some GPU mitigation solutions based on Built-In Self-Repair (BISR), exploiting spare modules to replace faulty units, have also been proposed [29]- [31]. Furthermore, some authors proposed the reconfiguration of computational modules [32], [33] and memories [34] in GPUs once a fault is detected.…”

Section: B Mitigation Strategiesmentioning

confidence: 99%

An Effective Method to Identify Microarchitectural Vulnerabilities in GPUs

Condia

Rech

Santos

et al. 2022

IEEE Trans. Device Mater. Relib.

Self Cite

View full text Add to dashboard Cite

Graphics Processing Units (GPUs) are increasingly adopted in several domains where reliability is fundamental, such as self-driving cars and autonomous systems. Unfortunately, GPU devices have been shown to have a high error rate, while the constraints imposed by real-time safety-critical applications make traditional (and costly) replication-based hardening solutions inadequate.This work proposes an effective methodology to identify the architectural vulnerable sites in GPUs modules, i.e. the locations that, if corrupted, most affect the correct instructions execution. We first identify, through an innovative method based on Register-Transfer Level (RTL) fault injection experiments, the architectural vulnerabilities of a GPU model. Then, we mitigate the fault impact via selective hardening applied to the flip-flops that have been identified as critical. We evaluate three hardening strategies: Triple Modular Redundancy (TMR), Triple Modular Redundancy against SETs (∆TMR), and Dual Interlocked Storage Cells (Dice flip-flops). The results gathered on a publicly available GPU Model (FlexGripPlus) considering functional units, pipeline registers, and warp scheduler controller show that our method can tolerate from 85% to 99% of faults in the pipeline registers, from 50% to 100% of faults in the functional units and up to 10% of faults in the warp scheduler, with a reduced hardware overhead (in the range of 58 % to 94% when compared with traditional TMR).Finally, we adapt the methodology to perform a complementary evaluation targeting permanent faults and identify critical sites prone to propagate fault effects across the GPU. We found that a considerable percentage (65% to 98%) of flip-flops that are critical for transient faults are also critical for permanent faults.

show abstract

Section: B Mitigation Strategiesmentioning

confidence: 99%

An Effective Method to Identify Microarchitectural Vulnerabilities in GPUs

Condia

Rech

Santos

et al. 2022

IEEE Trans. Device Mater. Relib.

Self Cite

View full text Add to dashboard Cite

show abstract

“…In [28], a hybrid approach called Dynamic Duplication with Comparison (DDWC) is presented aimed to detect faults in the execution cores during the in-field operation. Similarly, in [29], and [30], the authors propose mitigation solutions for similar structures by adapting the BISR mechanism to replace faulty modules during the manufacturing process and the in-field operation, respectively. Nevertheless, most currently adopted fault-tolerance solutions for GPGPUs do not provide the detection and the mitigation of faults using the same architecture.…”

Section: Introductionmentioning

confidence: 99%

DYRE: a DYnamic REconfigurable solution to increase GPGPU’s reliability

et al. 2021

Self Cite

View full text Add to dashboard Cite

General-purpose graphics processing units (GPGPUs) are extensively used in high-performance computing. However, it is well known that these devices’ reliability may be limited by the rising of faults at the hardware level. This work introduces a flexible solution to detect and mitigate permanent faults affecting the execution units in these parallel devices. The proposed solution is based on adding some spare modules to perform two in-field operations: detecting and mitigating faults. The solution takes advantage of the regularity of the execution units in the device to avoid significant design changes and reduce the overhead. The proposed solution was evaluated in terms of reliability improvement and area, performance, and power overhead costs. For this purpose, we resorted to a micro-architectural open-source GPGPU model (FlexGripPlus). Experimental results show that the proposed solution can extend the reliability by up to 57%, with overhead costs lower than 2% and 8% in area and power, respectively.

show abstract

A dynamic hardware redundancy mechanism for the in-field fault detection in cores of GPGPUs

Condia¹,

Narducci²,

Reorda³

et al. 2020

2020 23rd International Symposium on Design and Diagnostics of Electronic Circuits &Amp; Systems (DDECS)

Self Cite

View full text Add to dashboard Cite

In the past, in most General-Purpose Graphic Processing Units (GPGPUs) application fields (e.g., multimedia and gaming), the reliability features were not so relevant. Nowadays, GPGPUs are used in new domains, such as the automotive one, where reliability plays a significant role. In this work, we describe a dynamic duplication with a comparison (DDWC) mechanism intended to harden the Scalar Processor (SP) units located in the Streaming multiprocessors (SM) of a GPGPU. The proposed mechanism targets the permanent faults that may arise inside the SPs. One additional SP unit is included in the system to compute redundantly the same operations of a selected SP. Results are compared, and possible failures detected. A custom reconfiguration instruction allows the dynamic selection of the target SP to be monitored. Experimental results show that the proposed mechanism introduces a limited area overhead while it provides a significant increase in the in-field fault detection capabilities of the GPGPU. Its flexibility allows selecting the best trade-off between fault detection latency and performance overhead.

show abstract

A dynamic reconfiguration mechanism to increase the reliability of GPGPUs

Cited by 4 publications

References 28 publications

An Effective Method to Identify Microarchitectural Vulnerabilities in GPUs

An Effective Method to Identify Microarchitectural Vulnerabilities in GPUs

DYRE: a DYnamic REconfigurable solution to increase GPGPU’s reliability

A dynamic hardware redundancy mechanism for the in-field fault detection in cores of GPGPUs

Contact Info

Product

Resources

About