Efficacy and efficiency of algorithm-based fault-tolerance on GPUs

Wunderlich, Hans-Joachim; Braun, Claus; Halder, Sebastian

doi:10.1109/iolts.2013.6604090

Cited by 17 publications

(15 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Highly parallel computing architectures, like the Xeon Phi, have some reliability weaknesses [15,16,21,49]. For instance, a single particle generating a radiation-induced failure in the scheduler or shared memories (used to expedite parallel executions), is likely to affect the computation of several parallel threads.…”

Section: Background 21 Transient Errors Effects In Hpcmentioning

confidence: 99%

Experimental and analytical study of Xeon Phi reliability

Oliveira

Pilla

DeBardeleben

et al. 2017

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

We present an in-depth analysis of transient faults effects on HPC applications in Intel Xeon Phi processors based on radiation experiments and high-level fault injection. Besides measuring the realistic error rates of Xeon Phi, we quantify Silent Data Corruption (SDCs) by correlating the distribution of corrupted elements in the output to the application's characteristics. We evaluate the benefits of imprecise computing for reducing the programs' error rate. For example, for HotSpot a 0.5% tolerance in the output value reduces the error rate by 85%.We inject different fault models to analyze the sensitivity of given applications. We show that portions of applications can be graded by different criticalities. For example, faults occurring in the middle of LUD execution, or in the Sort and Tree portions of CLAMR, are more critical than the remaining portions. Mitigation techniques can then be relaxed or hardened based on the criticality of the particular portions.

show abstract

Section: Background 21 Transient Errors Effects In Hpcmentioning

confidence: 99%

Experimental and analytical study of Xeon Phi reliability

Oliveira

Pilla

DeBardeleben

et al. 2017

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

show abstract

“…Single and line are easily corrected in linear time on parallel devices [33], [47] while square and random errors are more difficult to detect and correct. Therefore, applying ABFT, DGEMM would be affected by only 20% to 40% of all errors on K40, and 60% to 80% on Xeon Phi.…”

Section: A Dgemmmentioning

confidence: 99%

Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators

Oliveira

Pilla

Hanzich

et al. 2017

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

Abstract-In this paper, we evaluate the error criticality of radiation-induced errors on modern High-Performance Computing (HPC) accelerators (Intel Xeon Phi and NVIDIA K40) through a dedicated set of metrics. We show that, as long as imprecise computing is concerned, the simple mismatch detection is not sufficient to evaluate and compare the radiation sensitivity of HPC devices and algorithms. Our analysis quantifies and qualifies radiation effects on applications' output correlating the number of corrupted elements with their spatial locality. Also, we provide the mean relative error (dataset-wise) to evaluate radiation-induced error magnitude.We apply the selected metrics to experimental results obtained in various radiation test campaigns for a total of more than 400 hours of beam time per device. The amount of data we gathered allows us to evaluate the error criticality of a representative set of algorithms from HPC suites. Additionally, based on the characteristics of the tested algorithms, we draw generic reliability conclusions for broader classes of codes. We show that arithmetic operations are less critical for the K40, while Xeon Phi is more reliable when executing particles interactions solved through Finite Difference Methods. Finally, iterative stencil operations seem the most reliable on both architectures.

show abstract

“…al. [30] report that ABFT for the generic matrix multiply (GEMM) routine incurs 18%-45% execution time overhead versus the unprotected GEMM on medium to large matrix dimensions under a GPU implementation. In conjunction with recent studies on soft errors in processors that indicate that hardware faults tend to happen in bursts [1], [19], [31], this shows that ABFT techniques may ultimately not be the best way to mitigate arbitrary fault patterns occurring in 32-bit or 64-bit data representations in memory, arithmetic or logic units of the utilized hardware.…”

Section: A Summary Of Prior Workmentioning

confidence: 99%

Reliable Linear, Sesquilinear, and Bijective Operations on Integer Data Streams Via Numerical Entanglement

Anam

Andreopoulos

2016

IEEE Trans. Signal Process.

View full text Add to dashboard Cite

Abstract-A new technique is proposed for fault-tolerant linear, sesquilinear and bijective (LSB) operations on M integer data streams (M ≥ 3), such as: scaling, additions/subtractions, inner or outer vector products, permutations and convolutions. In the proposed method, the M input integer data streams are linearly superimposed to form M numerically-entangled integer data streams that are stored in-place of the original inputs. A series of LSB operations can then be performed directly using these entangled data streams. The results are extracted from the M entangled output streams by additions and arithmetic shifts. Any soft errors affecting any single disentangled output stream are guaranteed to be detectable via a specific post-computation reliability check. In addition, when utilizing a separate processor core for each of the M streams, the proposed approach can recover all outputs after any single fail-stop failure. Importantly, unlike algorithm-based fault tolerance (ABFT) methods, the number of operations required for the entanglement, extraction and validation of the results is linearly related to the number of the inputs and does not depend on the complexity of the performed LSB operations. We have validated our proposal in an Intel processor (Haswell architecture with AVX2 support) via several types of operations: fast Fourier transforms, circular convolutions, and matrix multiplication operations. Our analysis and experiments reveal that the proposed approach incurs between 0.03% to 7% reduction in processing throughput for a wide variety of LSB operations. This overhead is 5 to 1000 times smaller than that of the equivalent ABFT method that uses a checksum stream. Thus, our proposal can be used in faultgenerating processor hardware or safety-critical applications, where high reliability is required without the cost of ABFT or modular redundancy.

show abstract

Efficacy and efficiency of algorithm-based fault-tolerance on GPUs

Cited by 17 publications

References 14 publications

Experimental and analytical study of Xeon Phi reliability

Experimental and analytical study of Xeon Phi reliability

Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators

Reliable Linear, Sesquilinear, and Bijective Operations on Integer Data Streams Via Numerical Entanglement

Contact Info

Product

Resources

About