QR decomposition on GPUs

Kerr, Andrew; Campbell, Daniel P.; Richards, Mark

doi:10.1145/1513895.1513904

Cited by 48 publications

(18 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This algorithm has O(n 3 ) complexity. For more details, we refer the interested reader to the comprehensive overview by Kerr et al [13]. Note that the block Householder algorithm utilizes the house() function, shown in Algorithm 2.…”

Section: Block Householder Qrmentioning

confidence: 99%

Transient Fault Resilient QR Factorization on GPUs

Loh

Ramanathan

Saluja

2015

Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale

View full text Add to dashboard Cite

With their inherent capability to exploit parallelism, GPUs have become a popular platform for data-intensive scientific computing applications. This trend is expected to continue as the number of computations required by scientific applications reach the petascale and even exascale range. As the minimum feature size of transistors decreases due to improving process technology, GPUs are becoming more vulnerable to transient faults caused by events such as power fluctuations and alpha particle strikes, therefore we need methods that guarantee correct computation even in the presence of such faults. In this paper, we develop and analyze three fault tolerant schemes, FC-O, PC-C and PC-CS, for the block Householder QR algorithm that can deal with faults in the streaming processor (SP) core of a GPU. We also present a transient fault injection mechanism for NVIDIA GPUs, which has the capability of injecting faults of varying durations. We show that two of our schemes, PC-C and PC-CS, have good error coverage and relatively low overhead, and can scale reasonably well at the petascale and exascale range.

show abstract

Section: Block Householder Qrmentioning

confidence: 99%

Transient Fault Resilient QR Factorization on GPUs

Loh

Ramanathan

Saluja

2015

Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale

View full text Add to dashboard Cite

show abstract

“…The details of multi-core, multi-GPU QR factorization scheduling are discussed in [3]. A solution for QR factorization that can be entirely run on the GPU is presented in [7]. For LU factorization on GPUs, a technique to reduce matrix decomposition and row operations to a series of rasterization problems is used [8].…”

Section: Related Workmentioning

confidence: 99%

Floating Point Architecture Extensions for Optimized Matrix Factorization

Pedram

Gerstlauer

Geijn

2013

2013 IEEE 21st Symposium on Computer Arithmetic

View full text Add to dashboard Cite

Abstract-This paper examines the mapping of algorithms encountered when solving dense linear systems and linear leastsquares problems to a custom Linear Algebra Processor. Specifically, the focus is on Cholesky, LU (with partial pivoting), and QR factorizations. As part of the study, we expose the benefits of redesigning floating point units and their surrounding datapaths to support these complicated operations. We show how adding moderate complexity to the architecture greatly alleviates complexities in the algorithm. We study design trade-offs and the effectiveness of architectural modifications to demonstrate that we can improve power and performance efficiency to a level that can otherwise only be expected of full-custom ASIC designs.A feasibility study shows that our extensions to the MAC units can double the speed of required vector-norm operations while reducing energy by 60%. Similarly, up to 20% speedup with 15% savings in energy can be achieved for LU factorization. We show how such efficiency is maintained even in the complex inner kernels of these operations.

show abstract

“…A solution for QR factorization that can be entirely run on the GPU is presented in [71]. For LU factorization on GPUs, a technique to reduce matrix decomposition and row operations to a series of rasterization problems is used [44].…”

Section: A1 Related Workmentioning

confidence: 99%

Algorithm/Architecture Codesign of Low Power and High Performance Linear Algebra Compute Fabrics

Pedram

2013

2013 IEEE International Symposium on Parallel &Amp; Distributed Processing, Workshops and PHD Forum

View full text Add to dashboard Cite

QR decomposition on GPUs

Cited by 48 publications

References 5 publications

Transient Fault Resilient QR Factorization on GPUs

Transient Fault Resilient QR Factorization on GPUs

Floating Point Architecture Extensions for Optimized Matrix Factorization

Algorithm/Architecture Codesign of Low Power and High Performance Linear Algebra Compute Fabrics

Contact Info

Product

Resources

About