On the Functional Test of Special Function Units in GPUs

Guerrero-Balaguera, Juan-David; Condia, Josie E. Rodriguez; Reorda, M. Sonza

doi:10.1109/ddecs52668.2021.9417025

Cited by 10 publications

(2 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Este https://doi.org/10.24050/reia.v20i39.1609 articulo brinda un procedimiento para describir unidades en punto flotante de acuerdo al estándar, permitiendo abordar los procesos de normalización, redondeo, ajuste de mantisas y casos especiales que se presentan al efectuar operaciones. Dado que este diseño es totalmente combinacional es escalable y por lo tanto se puede ajustar a doble precisión, sin embargo, se requieren más recursos lógicos para su implementación, además, es un punto de partida para llevar a cabo la evaluación de test sobre diferentes arquitecturas de circuitos digitales que realizan operaciones en representación punto flotante (Cantoro et al, 2016;Condia et al, 2020;Guerrero-Balaguera, Condia y Reorda, 2021)we describe techniques for the generation of SBST programs to be run on-line by an embedded microprocessor to detect faults in the Floating-Point Unit (FPU. Una ventaja del diseño totalmente combinacional es que se puede paralelizar fácilmente usando técnicas de pipeline, ya que el diseño combinacional permite evaluar diferentes niveles de pipeline, así, el diseño propuesto se aplica para evaluar fallos en implementaciones futuras pipeline y no pipeline.…”

Section: Introductionunclassified

Unidad aritmética de punto flotante: diseño e implementación con portabilidad

Patarroyo-Gutierrez

Hernández

Meléndez

2022

reveia

View full text Add to dashboard Cite

El uso de las unidades de punto flotante (FPU) en el procesamiento digital se ha incrementado dada la alta precisión y rango de números que se pueden representar. En el procesamiento de imágenes, filtros digitales con respuesta infinita al impulso (IIR), respuesta finita al impulso (FIR) y controladores digitales, se requieren este tipo de unidades para obtener resultados más precisos y evitar respuestas inestables, sin embargo, para implementar estas, algunos procesadores tienen unidades incorporadas lo que implica una dependencia tecnológica de los fabricantes para desarrollar prototipos. Para evitar dicha dependencia, en este artículo se presenta el diseño de los módulos para las operaciones más usadas en el procesamiento digital de señales: multiplicación y la suma/resta. Se presentan los pasos y consideraciones a tener en cuenta como las excepciones, redondeo y normalización de operandos, para lograr implementar estas operaciones en cualquier matriz de puertas lógicas programables en campo (FPGA). Se comprueban resultados utilizando el banco de pruebas MODELSIM® y se determinó la tasa de error, utilizando MATLAB®.

show abstract

Section: Introductionunclassified

Unidad aritmética de punto flotante: diseño e implementación con portabilidad

Patarroyo-Gutierrez

Hernández

Meléndez

2022

reveia

View full text Add to dashboard Cite

show abstract

“…March/April 2023 Pseudorandom and ATPG-based approaches are effective in regular structures of a GPU, such as the functional units and the register file, since these structures are addressed (and tested) in parallel. Moreover, the static organization and the understanding of distribution policies in the schedulers allow the development of embarrassingly parallel TPs (see Figure 2), exploiting the multithread parallelism to inject patterns and also reducing the in-field execution of TPs [8]. On the other hand, deterministic approaches exploit the functionality and structure in a module to deploy well-defined algorithms, such as March algorithms for internal memories (e.g., within the controllers) [4].…”

mentioning

confidence: 99%

Using STLs for Effective In-Field Test of GPUs

et al. 2023

Self Cite

View full text Add to dashboard Cite

 Modern graphics processing units (GPUs) are manufactured using cutting-edge technologies but are prone to suffer from in-field errors and reliability issues [1]. The flexibility and computational power of GPUs push their adoption in developing advanced driver-assistance systems (ADASs) and sensor fusion solutions in the automotive and autonomous systems domains. However, the premature aging and wear-out features in new transistor technologies promote the rising of permanent faults during the in-field operation. In safety-critical applications, unaffordable failures caused by faults can induce the entire system to fail or even result in catastrophic consequences if no appropriate measures are taken promptly. Hence, the development of countermeasures for the in-field detection of faults is of great importance in GPUs. Publishedworks, addressing in-field fault detection for GPUs, can be classified into three classes: 1) design for testability (DfT) methods, which are purely hardware-oriented; 2) hybrid approaches, which combine hardware structures with reconfigurable capabilities at the software level; and 3) software-based self-test (SBST) solutions. DfT schemes are widely used for the end-of-production test in current devices. However, they are not always available for in-field operation and may not satisfy time constraints in many applications. Furthermore, hybrid solutions, based on the addition or use of available structures (i.e., performance counters) to extend the fault observability of a module, must be included in the design phases by modifying the hardware-software interface to provide instruction-based control of the included structures. Jagannadha et al. [2] proposed an in-system-test architecture based on the combination of DfT schemes and hybrid structures to detect faults and provide diagnosis features during the in-field operation of system-on-chips (SoCs) and GPUs. However, a massive effort is required to

show abstract

Investigating and Reducing the Architectural Impact of Transient Faults in Special Function Units for GPUs

Rodriguez Condia,

Guerrero-Balaguera,

Patiño Núñez

et al. 2024

J Electron Test

View full text Add to dashboard Cite

Ensuring the reliability of GPUs and their internal components is paramount, especially in safety-critical domains like autonomous machines and self-driving cars. These cutting-edge applications heavily rely on GPUs to implement complex algorithms due to their implicit programming flexibility and parallelism, which is crucial for efficient operation. However, as integration technologies advance, there is a growing concern regarding the potential increase in fault sensitivity of the internal components of current GPU generations. In particular, Special Function Unit (SFU) cores inside GPUs are used in multimedia, High-Performance Computing, and neural network training. Despite their frequent usage and critical role in several domains, reliability evaluations on SFUs and the development of effective mitigation solutions have yet to be studied and remain unexplored. This work evaluates the impact of transient faults in the main hardware structures of SFUs in GPUs. In addition, we analyze the main overhead costs and benefits of developing selective-hardening mechanisms for SFUs. We focus on evaluating and analyzing two SFU architectures for GPUs (’fused’ and ’modular’) and their relations to energy, area, and reliability impact on parallel applications. The experiments resort to fine-grain fault injection campaigns on an RTL GPU model (FlexGripPlus) instrumented with both SFUs. The results on both SFU architectures indicate that fused SFUs (in commercial-grade devices) require lower area overhead (about 27%) for their integration in GPUs but are more vulnerable to transient faults (in up to 47% for the analyzed cases) and less power efficient (in up to 36.6%) than modular SFUs. Moreover, the reliability estimation shows that Modular SFUs are structurally more resilient than Fused ones in up to one order of magnitude. Similarly, selective-hardening mechanism based on Triple-Modular Redundancy (TMR) shows that coarse-grain strategies might increase the reliability of the overall SFUs under feasible overhead costs.

show abstract

On the Functional Test of Special Function Units in GPUs

Cited by 10 publications

References 14 publications

Unidad aritmética de punto flotante: diseño e implementación con portabilidad

Unidad aritmética de punto flotante: diseño e implementación con portabilidad

Using STLs for Effective In-Field Test of GPUs

Investigating and Reducing the Architectural Impact of Transient Faults in Special Function Units for GPUs

Contact Info

Product

Resources

About