Matrix Multiplication on GPUs with On-Line Fault Tolerance

Ding, Chong; Karlsson, Christer; Liu, Hui; Davies, Teresa; Chen, Zizhong

doi:10.1109/ispa.2011.50

Cited by 48 publications

(22 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Chen [6] analyzes the block row data partitioning scheme for sparse matrices and derives a sufficient condition for recovering critical data without checkpointing. Ding et al [25] construct a column/row checksum matrix for matrix multiplication for GPUs. During computation, the partial product matrix is scanned so that soft errors can be detected and corrected at runtime.…”

Section: B Software Solutions To Soft Errorsmentioning

confidence: 99%

Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool

Liu

Vetter

2012

2012 International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

Extreme-scale scientific applications are at a significant risk of being hit by soft errors on supercomputers as the scale of these systems and the component density continues to increase. In order to better understand the specific soft error vulnerabilities in scientific applications, we have built an empirical fault injection and consequence analysis tool -BIFITthat allows us to evaluate how soft errors impact applications. In particular, BIFIT is designed with capability to inject faults at very specific targets: an arbitrarily-chosen execution point and any specific data structure. We apply BIFIT to three missioncritical scientific applications and investigate the applications vulnerability to soft errors by performing thousands of statistical tests. We, then, classify each applications individual data structures based on their sensitivity to these vulnerabilities, and generalize these classifications across applications. Subsequently, these classifications can be used to apply appropriate resiliency solutions to each data structure within an application. Our study reveals that these scientific applications have a wide range of sensitivities to both the time and the location of a soft error; yet, we are able to identify intrinsic relationships between application vulnerabilities and specific types of data objects. In this regard, BIFIT enables new opportunities for future resiliency research.

show abstract

Section: B Software Solutions To Soft Errorsmentioning

confidence: 99%

Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool

Liu

Vetter

2012

2012 International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

show abstract

“…The performance results are competitive with those presented by a high-performance GPU, as presented in Ding et al [2011], when considering matrices of the same sizes. The power and energy consumption, on the other hand, are much smaller, even when scaling for the different implementation technologies.…”

Section: Maximum Performance Facing a Memory Wallmentioning

confidence: 81%

“…This section presents a comparison of the RA 3 architecture against the GPU, executing the very efficient ABFT implementation presented in Ding et al [2011]. A GPU is a highly parallel architecture divided into small computing units, named streaming processors, each instantiating a large amount of parallel threads.…”

Section: Comparison Of the Execution Time With The Gpumentioning

confidence: 99%

Adaptive Parallelism Exploitation under Physical and Real-Time Constraints for Resilient Systems

Itturiet

Nazar

Ferreira

et al. 2014

ACM Trans. Reconfigurable Technol. Syst.

View full text Add to dashboard Cite

FÁBIO ITTURIET, GABRIEL NAZAR, RONALDO FERREIRA,ÁLVARO MOREIRA, and LUIGI CARRO, Universidade Federal do Rio Grande do Sul This article introduces the resilient adaptive algebraic architecture that aims at adapting parallelism exploitation of a matrix multiplication algorithm in a time-deterministic fashion to reduce power consumption while meeting real-time deadlines present in most DSP-like applications. The proposed architecture provides low-overhead error correction capabilities relying on the hardware implementation of the algorithm-based fault-tolerance method that is executed concurrently with matrix multiplication, providing efficient occupation of memory and power resources. The Resilient Adaptive Algebraic Architecture (RA 3 ) is evaluated using three real-time industrial case studies from the telecom and multimedia application domains to present the design space exploration and the adaptation possibilities the architecture offers to hardware designers. RA 3 is compared in its performance and energy efficiency with standard high-performance architectures, namely a GPU and an out-of-order general-purpose processor. Finally, we present the results of fault injection campaigns in order to measure the architecture resilience to soft errors.

show abstract

“…Soft error in the GPU has been exploited [18], and methods have been developed to detect [36,40] and recover from error [35,25,24]. Recently, soft error in matrix multiplication on a GPU has also been studied [10].…”

Section: Related Workmentioning

confidence: 99%

Soft error resilient QR factorization for hybrid system with GPGPU

Łuszczek

Tomov

et al. 2013

Journal of Computational Science

View full text Add to dashboard Cite

As the general purpose graphics processing units (GPGPU) are increasingly deployed for scientific computing for its raw performance advantages compared to CPUs, the fault tolerance issue has started to become more of a concern than before when they were exclusively used for graphics applications. The pairing of GPUs with CPUs to form a hybrid computing systems for better flexibility and performance creates a massive amounts of computations that have a higher possibility to be affected by transient error -a soft error that silently modifies data causing errors to pass unnoticed. This is despite the fact that the newest Fermi generation of GPUs from NVIDIA are equipped with error correcting units to protect their memories. This problem is particularly serious for applications that employ numerical linear algebra since large sections of data are often modified between steps, and therefore even a single error could eventually propagate into a large area of result. In order to give protection to dense linear algebra computations on such hybrid systems, we developed an algorithm that is resilient to soft errors. We chose the right-looking Householder QR factorization as a demonstration of our algorithm for a hybrid system that features both GPUs and CPUs. Algorithm based fault tolerance (ABFT) is used to protect from errors in the trailing matrix and the right factor, while a checkpointing method is used to ensure the left factor is error-free. This work is based on a previous study of fault tolerance in matrix factorizations. Our contribution includes (1) a stable multiple-error checkpointing and recovery mechanism for the left-factor, which is also scalable in performance in the hybrid execution environment and does not cause severe performance degradation. (2) optimized Givens rotation utilities on the GPU to efficiently reduce an upper Hessenberg matrix to upper triangular form, and (3) a recovery algorithm based on QR update inside a hybrid system. Experimental results show that, our fault tolerant QR factorization can successfully detect and correct data altered by soft errors in both the left and right factors and we observe a decreasing percentage of overhead as the matrix size grows.

show abstract

Matrix Multiplication on GPUs with On-Line Fault Tolerance

Cited by 48 publications

References 20 publications

Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool

Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool

Adaptive Parallelism Exploitation under Physical and Real-Time Constraints for Resilient Systems

Soft error resilient QR factorization for hybrid system with GPGPU

Contact Info

Product

Resources

About