2014 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT) 2014
DOI: 10.1109/dft.2014.6962085
|View full text |Cite
|
Sign up to set email alerts
|

GPGPUs ECC efficiency and efficacy

Abstract: In this paper we assess and discuss the efficiency and overhead of the Error-Correcting Code (ECC) mechanism available on modern GPGPUs, which are increasingly used for both High Performance Computing and safety-critical applications. Both the resilience to radiation-induced silent data corruption and functional interruption are experimentally and analytically addressed. The provided experimental analysis demonstrates that the ECC significantly reduces the occurrence of silent data corruption but may not be su… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
13
0

Year Published

2015
2015
2022
2022

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 19 publications
(13 citation statements)
references
References 15 publications
0
13
0
Order By: Relevance
“…-Detection with redundancy without diversity [163,6,191,178,199,57,73,138,99,164,34,170,120,140,77,127,139] and with diversity [13,14,8,12] -Detection and/or correction with coding (e.g., ECC) and checkers [57,120,169,139,108,124,125,140,132] -Recovery with re-execution or checkpoints [71,175,180,135,115,124] -Mitigation with shielding and reconfiguration [153,127,91,193] • Application-dependent:…”
Section: Random Hw Failuresmentioning
confidence: 99%
See 2 more Smart Citations
“…-Detection with redundancy without diversity [163,6,191,178,199,57,73,138,99,164,34,170,120,140,77,127,139] and with diversity [13,14,8,12] -Detection and/or correction with coding (e.g., ECC) and checkers [57,120,169,139,108,124,125,140,132] -Recovery with re-execution or checkpoints [71,175,180,135,115,124] -Mitigation with shielding and reconfiguration [153,127,91,193] • Application-dependent:…”
Section: Random Hw Failuresmentioning
confidence: 99%
“…-Detection and fault-tolerance algorithmic approaches [35,138,169,170,168,195,140,49,50,110,139,156,102,202,80,204,58,76] -Fault-tolerance based on intrinsic application and/or input data characteristics [63,78,159,158] §4…”
Section: Random Hw Failuresmentioning
confidence: 99%
See 1 more Smart Citation
“…Most of the experiments have been performed on high-performance NVIDIA architectures, including Fermi, Kepler or Pascal using neutron beams [11]- [13]. Several papers have evaluated the efficiency, efficacy and overhead of the hardware hardening ECC mechanism provided by most modern GPUs [14], [15]. Most of them conclude that this protection mechanism significantly reduces the SDCs, but increases the Single Event Functional Interrupts (SEFI) of the algorithms, for example when multiple-bit faults arise.…”
Section: Related Workmentioning
confidence: 99%
“…This offloading of computation poses new issues related to both the access to the execution state and the particular reliability characteristics of acceleration devices. For example, GPUs tend to have higher DUEs per GB than CPUs [ 74 , 156 , 169 ] and GPUs may come with large memory ports (e.g., 128 bit for High-Bandwidth Memory 2 technologies) as well as reduced correction capabilities [ 132 ]. As an example, the work in Reference [ 60 ] shows that the DUE rate per GB for GDDR5 memory in NVIDIA Kepler GPUs can be as high as five times the DUE rate of CPU memory equipped with state-of-the-art error checking and correction support.…”
Section: 21mentioning
confidence: 99%