Impact of GPUs Parallelism Management on Safety-Critical and HPC Applications Reliability

Rech, Paolo; Pilla, Laércio Lima; Navaux, Philippe O. A.; Carro, Luigi

doi:10.1109/dsn.2014.49

Cited by 70 publications

(56 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The ECC seems effective in reducing SDC occurrences. Some SDC still occur even if ECC is enabled as SET on logic resources or scheduler failures are left undetected, and both these resources have been demonstrated to contribute significantly to GPGPUs SDC rate [7], [18].…”

Section: B Error-correcting Code Capabilities and Overheadmentioning

confidence: 98%

“…A SDC is typically produced when radiation corrupts memory elements storing variables or data used for computation, or when the logic executing an operation experience a Single Event Transient (SET) [9]. Additionally, on GPGPUs a SDC can occur when the scheduler fails in synchronizing threads, assigning a thread to a proper CUDA core, or presents results that are still incomplete [18]. A FI happens, for instance, when radiation induces a control flow error and prevents the application running on the GPGPU from being completed, when a scheduler failure hangs the GPGPU kernel or when the PCI-Express bus controller is corrupted.…”

Section: B Experimental Setupmentioning

confidence: 99%

See 1 more Smart Citation

GPGPUs ECC efficiency and efficacy

Oliveira

Rech

Pilla

et al. 2014

2014 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)

View full text Add to dashboard Cite

In this paper we assess and discuss the efficiency and overhead of the Error-Correcting Code (ECC) mechanism available on modern GPGPUs, which are increasingly used for both High Performance Computing and safety-critical applications. Both the resilience to radiation-induced silent data corruption and functional interruption are experimentally and analytically addressed. The provided experimental analysis demonstrates that the ECC significantly reduces the occurrence of silent data corruption but may not be sufficient to guarantee high reliability. Moreover, the ECC increases the GPGPU functional interruption rate. Finally, the ECC performances and reliability are compared to Algorithm-Based Fault Tolerance and Duplication With Comparison strategies.

show abstract

Section: B Error-correcting Code Capabilities and Overheadmentioning

confidence: 98%

Section: B Experimental Setupmentioning

confidence: 99%

GPGPUs ECC efficiency and efficacy

Oliveira

Rech

Pilla

et al. 2014

2014 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)

View full text Add to dashboard Cite

show abstract

“…The scheduler on NVIDIA devices is implemented in hardware and has already been demonstrated to contribute to the device radiation sensitivity [34]. Intel Xeon Phi relies on the operating system to manage execution [22] which may be less susceptible to radiation-induced failures.…”

Section: A Dgemmmentioning

confidence: 99%

“…It is worth noting that while the K40 thread management seems to increase its sensitivity, it may be more efficient. The K40 may then produce more correct data before experiencing a failure [34].…”

Section: A Dgemmmentioning

confidence: 99%

Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators

Oliveira

Pilla

Hanzich

et al. 2017

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

Abstract-In this paper, we evaluate the error criticality of radiation-induced errors on modern High-Performance Computing (HPC) accelerators (Intel Xeon Phi and NVIDIA K40) through a dedicated set of metrics. We show that, as long as imprecise computing is concerned, the simple mismatch detection is not sufficient to evaluate and compare the radiation sensitivity of HPC devices and algorithms. Our analysis quantifies and qualifies radiation effects on applications' output correlating the number of corrupted elements with their spatial locality. Also, we provide the mean relative error (dataset-wise) to evaluate radiation-induced error magnitude.We apply the selected metrics to experimental results obtained in various radiation test campaigns for a total of more than 400 hours of beam time per device. The amount of data we gathered allows us to evaluate the error criticality of a representative set of algorithms from HPC suites. Additionally, based on the characteristics of the tested algorithms, we draw generic reliability conclusions for broader classes of codes. We show that arithmetic operations are less critical for the K40, while Xeon Phi is more reliable when executing particles interactions solved through Finite Difference Methods. Finally, iterative stencil operations seem the most reliable on both architectures.

show abstract

“…Recently, the research community has started tackling the challenging problem of characterizing the reliability of GPGPU based systems, i.e., their vulnerability to soft and hard errors [1] [2]. This challenging problem requires the development of accurate and fast reliability assessment techniques to deal with the delicate trade-off between analysis time and accuracy of the reported measurements and the ability to provide results that can guide system designers in the choice and development of efficient error resilience mechanisms.…”

Section: Introductionmentioning

confidence: 99%