SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation

Hari, Siva Kumar Sastry; Tsai, Timothy; Stephenson, Mark W.; Keckler, Stephen W.; Emer, Joel

doi:10.1109/ispass.2017.7975296

Cited by 131 publications

(62 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The fault-injection process has been conducted using the NVIDIA's fault injector, SASSIFI [34], which allows us to understand and analyze the error occurrence and its propagation in ResNet models. SASSIFI injects errors at the GPU's Instruction Set Architecture (ISA) visible state, such as general-purpose registers (GPRs), predicate registers (PR), condition-code registers (CC), and memory values [34]. SASSIFI has three error-injection modes to use: Register File (RF), Instruction Output Address (IOA), and Instruction Output Value (IOV) [34].…”

Section: B Fault-injection Setupmentioning

confidence: 99%

Soft Error Resilience of Deep Residual Networks for Object Recognition

et al. 2020

View full text Add to dashboard Cite

Convolutional Neural Networks (CNNs) have truly gained attention in object recognition and object classification in particular. When being implemented on Graphics Processing Units (GPUs), deeper networks are more accurate than shallow ones. Residual Networks (ResNets) are one of the deepest CNN architectures used in various fields including safety-critical ones. GPUs have proven to be the major accelerator for CNN models. However, modern GPUs are prone to radiation-induced soft errors, which is a serious issue in safety-compliant systems. In this work, we analyze and propose an approach to address the reliability of ResNet on GPUs. We firstly analyze three popular ResNet models, explicitly, ResNet-50, ResNet-101, and ResNet-152 through NVIDIA's fault injector, SASSIFI. We perform an indepth analysis of the model from the perspective of layer and kernel vulnerability. Then, we experimentally show the vulnerability of ResNet models and identify the most vulnerable portions. Finally, we validate our solution, which is a selective-hardening technique, through hardening the worth-hardening kernels to avoid unnecessary overheads. Our strategy is demonstrated to mask up to 93.38% of the injected errors with performance overhead less than 5.35%. Furthermore, the percentage of the errors causing misclassifications can be reduced from 4.2% to 0.104%, thereby significantly improving the model's reliability.

show abstract

Section: B Fault-injection Setupmentioning

confidence: 99%

Soft Error Resilience of Deep Residual Networks for Object Recognition

et al. 2020

View full text Add to dashboard Cite

show abstract

“…They aimed at injecting the faults that represent real hardware errors where they adopted the single-bit-flip fault model to simulate transient faults in GPU processors. Hari et al [49] presented an fault injection-based framework called SASSIFI for GPU application resilience evaluation, especially on soft errors. SASSIFI serves two kinds of tasks: (1) inject bit-flip errors into the register file for AVF analysis; (2) inject errors in the outputs of the instructions for error propagation evaluation.…”

Section: Related Workmentioning

confidence: 99%

Massively Parallel, Highly Efficient, but What About the Test Suite Quality? Applying Mutation Testing to GPU Programs

Zhu

Zaidman

2020

2020 IEEE 13th International Conference on Software Testing, Validation and Verification (ICST)

View full text Add to dashboard Cite

Thanks to rapid advances in programmability and performance, GPUs have been widely applied in High-Performance Computing (HPC) and safety-critical domains. As such, quality assurance of GPU applications has gained increasing attention. This brings us to mutation testing, a fault-based testing technique that assesses the test suite quality by systematically introducing small artificial faults. It has been shown to perform well in exposing faults. In this paper, we investigate whether GPU programming can benefit from mutation testing. In addition to conventional mutation operators, we propose nine GPU-specific mutation operators based on the core syntax differences between CPU and GPU programming. We conduct a preliminary study on six CUDA systems. The results show that mutation testing can effectively evaluate the test quality of GPU programs: conventional mutation operators can guide the engineers to write simple direct tests, while GPU-specific mutation operators can lead to more intricate test cases which are better at revealing GPU-specific weaknesses.

show abstract

“…Fault injection techniques are typically used to study, analyze and evaluate the behaviour of a system susceptible to faults [33]- [35]. The fault model for the ALPHA core components is based on single-and multi-bit transient faults.…”

Section: Fault Injectionmentioning

confidence: 99%

Architectural-Space Exploration of Heterogeneous Reliability and Checkpointing Modes for Out-of-Order Superscalar Processors

et al. 2019

View full text Add to dashboard Cite

Reliability for multi-core processors has emerged as an important design constraint. A key research challenge is to detect and/or mitigate transient faults, such as soft errors, that can abruptly terminate an executing application or generate incorrect output, both leading to undesirable effects that can potentially be catastrophic in safety-critical systems. State-of-the-art reliability techniques and mechanisms deploy full-scale redundancy, like double or triple modular redundancy (DMR, TMR), on different layers of the computing stack to detect and/or correct such transient faults. However, the techniques relying on fullscale redundancy incur significant area, performance, and/or power overheads, which might not always be feasible/practical due to system constraints such as deadlines and available power budget for the full chip (or a processor core). Moreover, depending on the inherent resilience of an application, not every application requires full-scale redundancy, that would, otherwise, result in resource/energy wastage. Hence, techniques relying on selective redundancy have recently been investigated by researchers.In this work, we propose a novel design methodology to generate and explore the architectural-space of heterogeneous reliability modes for out-of-order superscalar multi-core processors. These heterogeneous modes are iso-ISA (i.e., implement the same Instruction Set Architecture), but differ in terms of the microarchitectural implementation, i.e., different components are hardened with different reliability techniques. Hence, these heterogeneous modes enable varying reliability and power/area trade-offs, from which an optimal configuration can be chosen at run time to meet the reliability requirements of a given system, while reducing the corresponding power overheads (or alternatively solving the inverse problem, i.e., maximizing the reliability under a given power constraint). We implemented different reliability modes for the ALPHA 21264 out-of-order superscalar microprocessor, and integrated different cores with heterogeneous reliability modes in a multi-core configuration. Our experimental results show that a pareto-optimal heterogeneous reliability mode reduces the core vulnerability by 87%, on average, across multiple application workloads, with area and power overheads of 10% and 43%, respectively.To further enhance the design space of heterogeneous reliability modes, we investigate the effectiveness of combining different processor state compression techniques like Distributed Multi-threaded Checkpointing (DMTCP), Hash-based Incremental Checkpointing (HBICT) and GNU zip, such that the correct processor state can be recovered once a fault is detected. These state compression techniques aim at reducing the storage requirements of the processors' correct state, which is backed-up at an application checkpoint during its execution to ensure successful recovery. We reduced the checkpoint sizes by a factor of~6× using a unique combination of different state compression techniques. To validate ...

show abstract

SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation

Cited by 131 publications

References 18 publications

Soft Error Resilience of Deep Residual Networks for Object Recognition

Soft Error Resilience of Deep Residual Networks for Object Recognition

Massively Parallel, Highly Efficient, but What About the Test Suite Quality? Applying Mutation Testing to GPU Programs

Architectural-Space Exploration of Heterogeneous Reliability and Checkpointing Modes for Out-of-Order Superscalar Processors

Contact Info

Product

Resources

About