In this paper we investigate the reliability of Google's Coral Tensor Processing Units (TPUs) to both high energy atmospheric neutrons (at ChipIR) and thermal neutrons from a pulsed source (at EMMA) and from a reactor (at TENIS). We report data obtained with an overall fluence of 3.41 × 10 12 n/cm 2 for atmospheric neutrons (equivalent to more than 30 million years of natural irradiation) and of 7.55×10 12 n/cm 2 for thermal neutrons. We evaluate the behavior of TPUs executing elementary operations with increasing input sizes (standard convolutions or depthwise convolutions) as well as eight CNNs configurations (SSD MobileNet v2 and SSD MobileDet, trained with COCO dataset, and Inception v4 and ResNet-50, with ILSVRC2012 dataset). We found that, despite the high error rate, most neutrons-induced errors only slightly modify the convolution output and do not change the CNNs detection or classification. By reporting details about the error model we provide valuable information on how to design the CNNs to avoid neutron-induced events to lead to miss detections or classifications.
Extensive research efforts are being carried out to evaluate and improve the reliability of computing devices either through beam experiments or simulation-based fault injection. Unfortunately, it is still largely unclear to which extend fault injection can provide an accurate error rate estimation at early stages and if beam experiments can be used to identify the weakest resources in a device. The importance and challenges associated with a timely, but yet realistic reliability evaluation grow with the increase of complexity in both the hardware domain, with the integration of different types of cores in an SoC (System-on-Chip), and the software domain, with the OS (operating system) required to take full advantage of the available resources. In this paper, we combine and analyze data gathered with extensive beam experiments (on the final physical CPU hardware) and microarchitectural fault injections (on early microarchitectural CPU models). We target a standalone Arm Cortex-A5 CPU and an Arm Cortex-A9 CPU integrated into an SoC and evaluate their reliability in bare-metal and Linux-based configurations. Combining experimental data that covers more than 18 million years of device time with the result of more than 176,000 injections we find that both the SoC integration and the presence of the OS increase the system DUEs (Detected Unrecoverable Errors) rate (for different reasons) but do not significantly impact the SDCs (Silent Data Corruptions) rate which is solely attributed to the CPU core. Our reliability analysis demonstrates that even considering SoC integration and OS inclusion, early, pre-silicon microarchitecture-level fault injection delivers accurate SDC rates estimations and lower bounds for the DUE rates.
Duplication with Comparison (DWC) is an effective software-level solution to improve the reliability of computing devices. However, it introduces performance and energy consumption overheads that could be unsuitable for high-performance computing or real-time safety-critical applications.In this work, we present Reduced-Precision Duplication with Comparison (RP-DWC) as a means to lower the overhead of DWC by executing the redundant copy in reduced precision. RP-DWC is particularly suitable for modern mixed-precision architectures, such as NVIDIA GPUs, that feature dedicated functional units for computing with programmable accuracy. We discuss the benefits and challenges associated with RP-DWC and show that the intrinsic difference between the mixed-precision copies allows for detecting most, but not all, errors. However, as the undetected faults are the ones that fall into the difference between precisions, they are the ones that produce a much smaller impact on the application output and, thus, might be tolerated.We investigate RP-DWC impact into fault detection, performance, and energy consumption on Volta GPUs. Through fault injection and beam experiment, using three microbenchmarks and four real applications, we show that RP-DWC achieves an excellent coverage (up to 86%) with minimal overheads (as low as 0.1% time and 24% energy consumption overhead)
limit. I am immensely grateful for everything, and I am profundly proud of being his student. Besides that, although separated by an ocean, the weekly meetings and calls during these incredible three years have made us close and I am glad to call him a very important friend of mine.To the most important people in my life, my mom and dad, there are not enough words to show how grateful I am for all the motivation and support they have given me throughout my life.For all the happy moments together, the access to countless opportunities that changed my life, and unconditional love, I will be forever grateful to them. They always made me believe in myself, taught me the most important values in life and, for all that, I became who I am today. And I am immeasurably proud of who I am and the fathers I have! My sincere thanks to my girlfriend Letícia, who was very important for this work, not only because she helped me to review it with very wise opinions, but she showed me the most beautiful feelings and, with that, she genuinely inpired me. She has taught me a lot about life, especially making me happier than ever.I would like to express my deep gratitude to the most important friends of mine, Filipe, Guilherme, Rafael and Thomaz (alphabetically ordered!). We have been through the craziest and most amazing experiences together, as well as the funniest moments.They inspire me to find balance in me and encourage me to seek and be the best version of myself. I thank Guilherme, in particular, who supported me and was a fundamental person especially during my first year at the university.Last but not least, I would like to acknowledge the precious effort of the ChipIR and ILL teams. Thanks to them, either remotely or in person, we were able to carry out the radiation experiments that are the foundation of this work. AGRADECIMENTOSEu não poderia começar a expressar minha gratidão a ninguém que náo o meu professor Paolo Rech. Ele tem sido, de longe, o melhor professor que já tive e, claro, foi fundamental para a realização deste trabalho. Apesar de não ter tido a oportunidade de ser aluno dele em disciplinas na faculdade, ele tem sido meu orientador de iniciação científica há mais de três anos! Ele não só tem me ensinado muito tecnicamente e academicamente, mas sempre me motivou a aprender mais e ir além do que eu achava ser o meu limite. Sou imensamente grato por tudo, e tenho muito orgulho de ser seu aluno. Além disso, embora separados por um oceano, as reuniões e ligações semanais durante esses três incríveis anos nos aproximaram e tenho o prazer de chamá-lo de um amigo muito importante.Às pessoas mais importantes da minha vida, minha mãe e meu pai, não há palavras suficientes para demonstrar o quanto sou grato por toda a motivação e apoio que me deram ao longo da minha vida. Por todos os momentos felizes juntos, o acesso a inúmeras oportunidades que mudaram minha vida e o amor incondicional, serei eternamente grato a eles. Sempre me fizeram acreditar em mim mesmo, me ensinaram os valores mais importantes da vida e, por t...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.