2013
DOI: 10.1016/j.jocs.2013.01.004
|View full text |Cite
|
Sign up to set email alerts
|

Soft error resilient QR factorization for hybrid system with GPGPU

Abstract: As the general purpose graphics processing units (GPGPU) are increasingly deployed for scientific computing for its raw performance advantages compared to CPUs, the fault tolerance issue has started to become more of a concern than before when they were exclusively used for graphics applications. The pairing of GPUs with CPUs to form a hybrid computing systems for better flexibility and performance creates a massive amounts of computations that have a higher possibility to be affected by transient error -a sof… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
6
0

Year Published

2016
2016
2019
2019

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 15 publications
(6 citation statements)
references
References 31 publications
0
6
0
Order By: Relevance
“…This improvement in device specificity would result in significantly fewer false alarms and therefore reduce AF. At the same time, the challenges of "unpredictable code" and "interrupted or corrupt data" have been noted and may represent an important safety issue due to the potential for missing data or data misinterpretation, especially when using memory-intensive applications on devices that are continually operating for prolonged periods of time [89,[92][93][94][95].…”
Section: Technological Advances In Patient Monitorsmentioning
confidence: 99%
“…This improvement in device specificity would result in significantly fewer false alarms and therefore reduce AF. At the same time, the challenges of "unpredictable code" and "interrupted or corrupt data" have been noted and may represent an important safety issue due to the potential for missing data or data misinterpretation, especially when using memory-intensive applications on devices that are continually operating for prolonged periods of time [89,[92][93][94][95].…”
Section: Technological Advances In Patient Monitorsmentioning
confidence: 99%
“…Existing techniques that can ensure reliability to SDCs comprise two categories: (i) algorithm-based fault tolerance 1 (ABFT)-i.e., methods using checksums specifically tailored to the algorithm under consideration-that can reliably detect (and possibly correct) up to a limited number of SDCs [13], [17], [19], [25], [39], [46], [47], [60]; (ii) systems with dual modular redundancy (DMR), where all non-coinciding SDCs can be detected if the same operation is duplicated in two separate processors (or threads) that cross-validate their results [21], but SDCs cannot be corrected without using triple modular redundancy (TMR) [23].…”
Section: A Summary Of Prior Workmentioning
confidence: 99%
“…The "best-case" SDC distribution for this implementation would be having x SDCs spread in such a way that only one SDC occurs in a row or column of a partition. Since mABFT requires one division operation for error location and one subtraction operation for error correction [19], [54] , 2x operations would be required in order to correct the x detected SDCs.…”
Section: Proof Of Propositionmentioning
confidence: 99%
“…Since we are dealing with sparse matrices, we expect n ′ to be very small, and hence the computation of the norm to be accurate. Moreover, since the right hand side in (14) does not depend on x, it can be computed just once for a given matrix and weight vector. Clearly, using (14) as tolerance parameter guarantees no false positive (a computation without any error that is considered as faulty), but allows false negatives (an iteration during which an error occurs without being detected) when the perturbations of the result are small.…”
Section: Experiments 61 Setupmentioning
confidence: 99%
“…The pioneering paper of Huang and Abraham [10] describes an algorithm capable of detecting and correcting a single silent error striking a dense matrix-matrix multiplication by means of row and column checksums. ABFT protection has been successfully applied to dense LU [11], LU with partial pivoting [12], Cholesky [13] and QR [14] factorizations, and more recently to sparse kernels like SpMxV (matrix-vector product) and triangular solve [15]. The overhead induced by ABFT is usually small, which makes it a good candidate for error detection at each iteration of the CG algorithm.…”
Section: Introductionmentioning
confidence: 99%