2013 IEEE 19th International on-Line Testing Symposium (IOLTS) 2013
DOI: 10.1109/iolts.2013.6604090
|View full text |Cite
|
Sign up to set email alerts
|

Efficacy and efficiency of algorithm-based fault-tolerance on GPUs

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2014
2014
2018
2018

Publication Types

Select...
5
5

Relationship

0
10

Authors

Journals

citations
Cited by 17 publications
(15 citation statements)
references
References 14 publications
0
11
0
Order By: Relevance
“…Highly parallel computing architectures, like the Xeon Phi, have some reliability weaknesses [15,16,21,49]. For instance, a single particle generating a radiation-induced failure in the scheduler or shared memories (used to expedite parallel executions), is likely to affect the computation of several parallel threads.…”
Section: Background 21 Transient Errors Effects In Hpcmentioning
confidence: 99%
“…Highly parallel computing architectures, like the Xeon Phi, have some reliability weaknesses [15,16,21,49]. For instance, a single particle generating a radiation-induced failure in the scheduler or shared memories (used to expedite parallel executions), is likely to affect the computation of several parallel threads.…”
Section: Background 21 Transient Errors Effects In Hpcmentioning
confidence: 99%
“…Single and line are easily corrected in linear time on parallel devices [33], [47] while square and random errors are more difficult to detect and correct. Therefore, applying ABFT, DGEMM would be affected by only 20% to 40% of all errors on K40, and 60% to 80% on Xeon Phi.…”
Section: A Dgemmmentioning
confidence: 99%
“…al. [30] report that ABFT for the generic matrix multiply (GEMM) routine incurs 18%-45% execution time overhead versus the unprotected GEMM on medium to large matrix dimensions under a GPU implementation. In conjunction with recent studies on soft errors in processors that indicate that hardware faults tend to happen in bursts [1], [19], [31], this shows that ABFT techniques may ultimately not be the best way to mitigate arbitrary fault patterns occurring in 32-bit or 64-bit data representations in memory, arithmetic or logic units of the utilized hardware.…”
Section: A Summary Of Prior Workmentioning
confidence: 99%