2014
DOI: 10.1007/978-3-642-54420-0_66
|View full text |Cite
|
Sign up to set email alerts
|

GPU Behavior on a Large HPC Cluster

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2014
2014
2023
2023

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 19 publications
(10 citation statements)
references
References 4 publications
0
10
0
Order By: Relevance
“…Hard errors such as the ones we observed may never be detected in production environments without repeated runs of bitwise reproducible codes, and systems can pass standard acceptance testing while exhibiting considerable error on other calculations . The crashes and unreasonable energies observed in the worst of the failed runs may be written off as a program bug, and more concerningly, the subtle errors in final energy may be accepted as valid results of a code that produces a different answer each time because of rounding differences.…”
Section: Discussionmentioning
confidence: 85%
“…Hard errors such as the ones we observed may never be detected in production environments without repeated runs of bitwise reproducible codes, and systems can pass standard acceptance testing while exhibiting considerable error on other calculations . The crashes and unreasonable energies observed in the worst of the failed runs may be written off as a program bug, and more concerningly, the subtle errors in final energy may be accepted as valid results of a code that produces a different answer each time because of rounding differences.…”
Section: Discussionmentioning
confidence: 85%
“…Various studies have been conducted for understanding the reliability aspects of using GPU's in large-scale HPC systems. The studies suggest that the newer generation GPU's are more reliable, as are the large-scale HPC systems using them (i.e., the observed MTBF of systems using newer GPU's is much longer than their estimated MTBF) [2][3][4][5][6][7].…”
Section: Gpus For Exascale: Dues and Gpu Reliabilitymentioning
confidence: 99%
“…Unfortunately, GPUs have been shown to suffer from a high rate of Detected Unrecoverable Errors (DUEs) [2][3][4][5][6][7]. The mean time between failures (MTBF) is expected to become much worse as the number of compute nodes increases in the exascale generation.…”
Section: Introductionmentioning
confidence: 99%
“…Highly parallel computing architectures, like the Xeon Phi, have some reliability weaknesses [15,16,21,49]. For instance, a single particle generating a radiation-induced failure in the scheduler or shared memories (used to expedite parallel executions), is likely to affect the computation of several parallel threads.…”
Section: Background 21 Transient Errors Effects In Hpcmentioning
confidence: 99%