High-level Modeling of Manufacturing Faults in Deep Neural Network Accelerators

Kundu, Shamik; Soyyiğit, Ahmet; Hoque, Khaza Anuarul; Basu, Kanad

doi:10.1109/iolts50870.2020.9159704

Cited by 11 publications

(9 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Fault Resilience is an important characteristic of DNNs that can be defined as a function of accuracy loss. The DNN accelerators inherit such resilience for a considerable range of fault-bits to ensure reliable computation [10]. However, it may become insignificant, when faults incur performance degradation by influencing the MSBs, especially, such that they remain unmasked in the resulting outputs.…”

Section: A Fault Resiliencementioning

confidence: 99%

“…However, the impact of a fault may vary according to the activation functions, fault type and bit position, and approximation error resilience of each layer. Since, the permanent faults affect the performance of accurate DNNs more significantly than occasional transient faults [10]. Their impact may be more prominent in AxDNNs due to their inexact nature.…”

Section: Introductionmentioning

confidence: 99%

“…Their impact may be more prominent in AxDNNs due to their inexact nature. Recently, Kundu and Zhang et al explored the impact of permanent faults on multiple locations of the accurate DNNs [10] [11]. Hong et al elucidated their limits for different bit-wise parameter corruptions [12].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Exploring Fault-Energy Trade-offs in Approximate DNN Hardware Accelerators

Siddique

Basu

Hoque

2021

2021 22nd International Symposium on Quality Electronic Design (ISQED)

Self Cite

View full text Add to dashboard Cite

Systolic array-based deep neural network (DNN) accelerators have recently gained prominence for their low computational cost. However, their high energy consumption poses a bottleneck to their deployment in energy-constrained devices. To address this problem, approximate computing can be employed at the cost of some tolerable accuracy loss. However, such small accuracy variations may increase the sensitivity of DNNs towards undesired subtle disturbances, such as permanent faults. The impact of permanent faults in accurate DNNs has been thoroughly investigated in the literature. Conversely, the impact of permanent faults in approximate DNN accelerators (AxDNNs) is yet under-explored. The impact of such faults may vary with the fault bit positions, activation functions and approximation errors in AxDNN layers. Such dynamacity poses a considerable challenge to exploring the trade-off between their energy efficiency and fault resilience in AxDNNs. Towards this, we present an extensive layer-wise and bit-wise fault resilience and energy analysis of different AxDNNs, using the state-of-the-art Evoapprox8b signed multipliers. In particular, we vary the stuck-at-0, stuck-at-1 fault-bit positions, and activation functions to study their impact using the most widely used MNIST and Fashion-MNIST datasets. Our quantitative analysis shows that the permanent faults exacerbate the accuracy loss in AxDNNs when compared to the accurate DNN accelerators. For instance, a permanent fault in AxDNNs can lead up to 66% accuracy loss, whereas the same faulty bit can lead to only 9% accuracy loss in an accurate DNN accelerator. Our results demonstrate that the fault resilience in AxDNNs is orthogonal to the energy efficiency.

show abstract

Section: A Fault Resiliencementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Exploring Fault-Energy Trade-offs in Approximate DNN Hardware Accelerators

Siddique

Basu

Hoque

2021

2021 22nd International Symposium on Quality Electronic Design (ISQED)

Self Cite

View full text Add to dashboard Cite

show abstract

“…The safety critical bits in a machine learning system were identified by inducing soft errors in the network datapath and evaluating them on eight DNN models across six datasets [21]. A formal analysis on the impact of faults in the datapath of an accelerator has been illustrated using Discrete-Time Markov Chain (DTMC) formalism [22]. The intense performance penalty in a systolic array-based DNN accelerator has been demonstrated by inducing manufacturing defects in the datapath of the accelerator [4], [23], [24].…”

Section: B Dnn Accelerator and Its Reliabilitymentioning

confidence: 99%

Special Session: Reliability Analysis for AI/ML Hardware

Kundu¹,

Basu²,

Sadi

et al. 2021

2021 IEEE 39th VLSI Test Symposium (VTS)

Self Cite

View full text Add to dashboard Cite

Artificial intelligence (AI) and Machine Learning (ML) are becoming pervasive in today's applications, such as autonomous vehicles, healthcare, aerospace, cybersecurity, and many critical applications. Ensuring the reliability and robustness of the underlying AI/ML hardware becomes our paramount importance. In this paper, we explore and evaluate the reliability of different AI/ML hardware. The first section outlines the reliability issues in a commercial systolic array-based ML accelerator in the presence of faults engendering from devicelevel non-idealities in the DRAM. Next, we quantified the impact of circuit-level faults in the MSB and LSB logic cones of the Multiply and Accumulate (MAC) block of the AI accelerator on the AI/ML accuracy. Finally, we present two key reliability issues -circuit aging and endurance in emerging neuromorphic hardware platforms and present our system-level approach to mitigate them.

show abstract

“…However, the most recent works' main concentration is limited to AccDNNs. Recently, Kundu et al elucidated the impact of masking and non-masking of permanent faults in small-scaled Google-TPU-like systolic array-based DNN accelerators with their formal guarantees [20]. In another work, Santos et al injected the permanent faults in the register files of GPU to demonstrate their effect on reduced precision AccDNNs [21].…”

mentioning

confidence: 99%

Exposing Reliability Degradation and Mitigation in Approximate DNNs Under Permanent Faults

Siddique

Hoque

2023

IEEE Trans. VLSI Syst.

View full text Add to dashboard Cite

Approximate computing is known for enhancing deep neural network accelerators' energy efficiency by introducing inexactness with a tolerable accuracy loss. However, small accuracy variations may increase the sensitivity of these accelerators towards undesired subtle disturbances, such as permanent faults. The impact of permanent faults in accurate deep neural network (AccDNN) accelerators has been thoroughly investigated in the literature. Conversely, the impact of permanent faults and their mitigation in approximate DNN (AxDNN) accelerators is vastly under-explored. Towards this, we first present an extensive fault resilience analysis of approximate multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs) using the state-of-the-art Evoapprox8b multipliers in GPU and TPU accelerators. Then, we propose a novel fault mitigation method, i.e., fault-aware retuning of weights (Fal-reTune). Fal-reTune retunes the weights using a weight mapping function in the presence of faults for improved classification accuracy. To evaluate the fault resilience and the effectiveness of our proposed mitigation method, we used the most widely used MNIST, Fashion-MNIST, and CIFAR10 datasets. Our results demonstrate that the permanent faults exacerbate the accuracy loss in AxDNNs compared to the AccDNN accelerators. For instance, a permanent fault in AxDNNs can lead to 56% accuracy loss, whereas the same faulty bit can lead to only 4% accuracy loss in AccDNN accelerators. We empirically show that our proposed Fal-reTune mitigation method improves the performance of AxDNNs up to 98%, even with fault rates of up to 50%. Furthermore, we observe that the fault resilience in AxDNNs is orthogonal to their energy efficiency.

show abstract

High-level Modeling of Manufacturing Faults in Deep Neural Network Accelerators

Cited by 11 publications

References 10 publications

Exploring Fault-Energy Trade-offs in Approximate DNN Hardware Accelerators

Exploring Fault-Energy Trade-offs in Approximate DNN Hardware Accelerators

Special Session: Reliability Analysis for AI/ML Hardware

Exposing Reliability Degradation and Mitigation in Approximate DNNs Under Permanent Faults

Contact Info

Product

Resources

About