A low-level software-based fault tolerance approach to detect SEUs in GPUs' register files

Goncalves, Marcio M.; Saquetti, Mateus; Kastensmidt, Fernanda Lima; Azambuja, Jose Rodrigo

doi:10.1016/j.microrel.2017.07.035

Cited by 17 publications

(11 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similarly, some authors targeted and evaluated the effect of faults in data-path units [33], including the register file [34], and pipeline registers [35]. Other work proposed graceful performance degradation strategies to face permanent faults in SM units by employing specially instrumented kernels and coding styles, thus distributing the tasks across the available SMs [32].…”

Section: Related Work In the Areamentioning

confidence: 99%

An On-Line Testing Technique for the Scheduler Memory of a GPGPU

2020

View full text Add to dashboard Cite

The highly parallel processing capabilities and reduced power performance of General Purpose Graphics Processing Units (GPGPUs) have been crucial factors for their massive use in multiple fields, such as multimedia and high-performance computing applications. Nowadays, more demanding areas, such as automotive, employ GPGPU devices where safety and reliability are mandatory design constraints. Nevertheless, the structural complexity, the transistor density, and the implementation in the latest silicon technologies introduce challenges to match safety and reliability requirements. In these technologies, wear-out and aging are factors that may significantly increase the occurrence of permanent faults during the lifetime operation. Moreover, these faults may generate unacceptable misbehaviors during the execution of an application. These constraints require devising new methods for in-field fault detection, thus verifying the integrity and correct behavior of the device during its whole operational life. This work proposes a technique to generate functional self-test programs targeting the detection of permanent static faults in the memory of the warp scheduler of a GPGPU. The proposed technique can translate fault primitives, which represent the effect of faults in a memory cell, into self-test functions and programs composed of a sequence of operations to excite the fault in the memory and to propagate its effects to a visible location, thus detecting its presence. We focused on the memory in the warp scheduler because it represents a crucial module for the device operation. Furthermore, this memory is present in each Streaming Multiprocessor (SM) of a GPGPU. Some experimental results to validate the method have been gathered, resorting to the NVIDIA Visual Profiler and the Nsight Debugger using the NVIDIA-GEFORCE GTX GPU platform and a structural fault simulator. The CUDA programming environment was used to implement the test procedures. INDEX TERMS Functional test, general purpose graphics processing units (GPGPUs), memory test, software-based self-test (SBST).

show abstract

Section: Related Work In the Areamentioning

confidence: 99%

An On-Line Testing Technique for the Scheduler Memory of a GPGPU

2020

View full text Add to dashboard Cite

show abstract

“…The fault injection environment is based on the ModelSim framework, and the injection methodology we used is the same introduced in [5,20]. Further details regarding the descriptions and configurations of the used benchmarks can be found in [18].…”

Section: Fault Detection Capabilitiesmentioning

confidence: 99%

“…On the one hand, software DWC mechanisms exploit time redundancy by repeatedly executing instructions [5][6][7], functions [8], or application tasks [8][9][10]. At the end, results are compared to detect faults.…”

mentioning

confidence: 99%

A dynamic hardware redundancy mechanism for the in-field fault detection in cores of GPGPUs

Condia¹,

Narducci²,

Reorda³

et al. 2020

2020 23rd International Symposium on Design and Diagnostics of Electronic Circuits &Amp; Systems (DDECS)

View full text Add to dashboard Cite

In the past, in most General-Purpose Graphic Processing Units (GPGPUs) application fields (e.g., multimedia and gaming), the reliability features were not so relevant. Nowadays, GPGPUs are used in new domains, such as the automotive one, where reliability plays a significant role. In this work, we describe a dynamic duplication with a comparison (DDWC) mechanism intended to harden the Scalar Processor (SP) units located in the Streaming multiprocessors (SM) of a GPGPU. The proposed mechanism targets the permanent faults that may arise inside the SPs. One additional SP unit is included in the system to compute redundantly the same operations of a selected SP. Results are compared, and possible failures detected. A custom reconfiguration instruction allows the dynamic selection of the target SP to be monitored. Experimental results show that the proposed mechanism introduces a limited area overhead while it provides a significant increase in the in-field fault detection capabilities of the GPGPU. Its flexibility allows selecting the best trade-off between fault detection latency and performance overhead.

show abstract

“…Software-based approaches provide high detection rates at the cost of performance degradation. They insert additional instructions that must be executed by the processing system, therefore increasing execution runtime, and can be applied to any GPU architecture with an available program source-code [6]. Hardware-based approaches, on the other hand, can be applied with no performance degradation, as replicated hardware can be deployed in parallel with the original, and, as long as the critical path is not altered, the operating frequency can be maintained, but require access to GPU architecture description [7].…”

Section: Introductionmentioning

confidence: 99%

Improving GPU register file reliability with a comprehensive ISA extension

Goncalves

Condia

Reorda

et al. 2020

Microelectronics Reliability

View full text Add to dashboard Cite

This work proposes a comprehensive ISA extension to improve GPU reliability to transient effects. Three additional instructions are proposed, implemented, and combined with software-based datapath duplication. Modified program codes are compared to state-of-the-art software-based fault tolerance techniques in terms of execution time. The circuit area is evaluated against the original GPU architecture, and a fault injection campaign is performed to assess reliability. Results show that this comprehensive ISA extension improves performance and fault detection capabilities of software-based approaches at negligible costs in terms of circuit area. This work can help engineers in designing more efficient and resilient GPU architectures.

show abstract

A low-level software-based fault tolerance approach to detect SEUs in GPUs' register files

Cited by 17 publications

References 3 publications

An On-Line Testing Technique for the Scheduler Memory of a GPGPU

An On-Line Testing Technique for the Scheduler Memory of a GPGPU

A dynamic hardware redundancy mechanism for the in-field fault detection in cores of GPGPUs

Improving GPU register file reliability with a comprehensive ISA extension

Contact Info

Product

Resources

About