2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks 2014
DOI: 10.1109/dsn.2014.49
|View full text |Cite
|
Sign up to set email alerts
|

Impact of GPUs Parallelism Management on Safety-Critical and HPC Applications Reliability

Abstract: Graphics Processing Units (GPUs) offer high computational power but require high scheduling strain to manage parallel processes, which increases the GPU cross section. The results of extensive neutron radiation experiments performed on NVIDIA GPUs confirm this hypothesis. Reducing the application Degree Of Parallelism (DOP) reduces the scheduling strain but also modifies the GPU parallelism management, including memory latency, thread registers number, and the processors occupancy, which influence the sensitiv… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
56
0

Year Published

2014
2014
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 70 publications
(56 citation statements)
references
References 18 publications
0
56
0
Order By: Relevance
“…The ECC seems effective in reducing SDC occurrences. Some SDC still occur even if ECC is enabled as SET on logic resources or scheduler failures are left undetected, and both these resources have been demonstrated to contribute significantly to GPGPUs SDC rate [7], [18].…”
Section: B Error-correcting Code Capabilities and Overheadmentioning
confidence: 98%
See 1 more Smart Citation
“…The ECC seems effective in reducing SDC occurrences. Some SDC still occur even if ECC is enabled as SET on logic resources or scheduler failures are left undetected, and both these resources have been demonstrated to contribute significantly to GPGPUs SDC rate [7], [18].…”
Section: B Error-correcting Code Capabilities and Overheadmentioning
confidence: 98%
“…A SDC is typically produced when radiation corrupts memory elements storing variables or data used for computation, or when the logic executing an operation experience a Single Event Transient (SET) [9]. Additionally, on GPGPUs a SDC can occur when the scheduler fails in synchronizing threads, assigning a thread to a proper CUDA core, or presents results that are still incomplete [18]. A FI happens, for instance, when radiation induces a control flow error and prevents the application running on the GPGPU from being completed, when a scheduler failure hangs the GPGPU kernel or when the PCI-Express bus controller is corrupted.…”
Section: B Experimental Setupmentioning
confidence: 99%
“…The scheduler on NVIDIA devices is implemented in hardware and has already been demonstrated to contribute to the device radiation sensitivity [34]. Intel Xeon Phi relies on the operating system to manage execution [22] which may be less susceptible to radiation-induced failures.…”
Section: A Dgemmmentioning
confidence: 99%
“…It is worth noting that while the K40 thread management seems to increase its sensitivity, it may be more efficient. The K40 may then produce more correct data before experiencing a failure [34].…”
Section: A Dgemmmentioning
confidence: 99%
“…Recently, the research community has started tackling the challenging problem of characterizing the reliability of GPGPU based systems, i.e., their vulnerability to soft and hard errors [1] [2]. This challenging problem requires the development of accurate and fast reliability assessment techniques to deal with the delicate trade-off between analysis time and accuracy of the reported measurements and the ability to provide results that can guide system designers in the choice and development of efficient error resilience mechanisms.…”
Section: Introductionmentioning
confidence: 99%