Enhancing performance of Tall-Skinny QR factorization using FPGAs

Rafique, Abid; Kapre, Nachiket; Constantinides, George A.

doi:10.1109/fpl.2012.6339142

Cited by 14 publications

(11 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…If data is kept inside the device or a data reuse scheme can be devised (e.g. [33]), this benefits the FPGA.…”

Section: Discussionmentioning

confidence: 99%

“…This can be a disadvantage for memory-bound likelihood computations. Nevertheless, FPGAs enjoy massive on-chip memory bandwidth (20-40 TB/sec [33]) due to large amounts of built-in memory. GPU on-chip memory bandwidths are limited to 8 TB/sec and 1.5 TB/sec [33].…”

Section: Discussionmentioning

confidence: 99%

“…Nevertheless, FPGAs enjoy massive on-chip memory bandwidth (20-40 TB/sec [33]) due to large amounts of built-in memory. GPU on-chip memory bandwidths are limited to 8 TB/sec and 1.5 TB/sec [33]. If data is kept inside the device or a data reuse scheme can be devised (e.g.…”

Section: Discussionmentioning

confidence: 99%

See 2 more Smart Citations

Population-Based MCMC on Multi-Core CPUs, GPUs and FPGAs

Mingas

Bouganis

2016

IEEE Trans. Comput.

View full text Add to dashboard Cite

Abstract-Markov Chain Monte Carlo (MCMC) is a method to draw samples from a given probability distribution. Its frequent use for solving probabilistic inference problems, where big-scale data are repeatedly processed, means that MCMC runtimes can be unacceptably large. This paper focuses on population-based MCMC, a popular family of computationally intensive MCMC samplers; we propose novel, highly optimized accelerators in three parallel hardware platforms (multi-core CPUs, GPUs and FPGAs), in order to address the performance limitations of sequential software implementations. For each platform, we jointly exploit the nature of the underlying hardware and the special characteristics of population-based MCMC. We focus particularly on the use of custom arithmetic precision, introducing two novel methods which employ custom precision in the largest part of the algorithm in order to reduce runtime, without causing sampling errors. We apply these methods to all platforms. The FPGA accelerators are up to 114x faster than multi-core CPUs and up to 53x faster than GPUs when doing inference on mixture models.

show abstract

“…If data is kept inside the device or a data reuse scheme can be devised (e.g. [33]), this benefits the FPGA.…”

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Population-Based MCMC on Multi-Core CPUs, GPUs and FPGAs

Mingas

Bouganis

2016

IEEE Trans. Comput.

View full text Add to dashboard Cite

show abstract

“…The least squares method (LS) or total least squares (TLS) [14] can be applied to find the frequencies of multiple incident sources. The proposed methods employ the LU factorization and the TLS to estimate the unknown frequency ω i similar to the ESPRIT algorithm using the following steps:…”

Section: Stage 1: Frequency Estimationmentioning

confidence: 99%

Joint Frequency and Time Estimation Algorithms

et al. 2016

View full text Add to dashboard Cite

In this paper, we present six subspace decomposition based methods for joint time of arrival (TOA) and frequency of arrival (FOA) estimation of multiple incident sources. These are LU-TLS, QR-TLS, direct TSQR-TLS, direct TSLU-TLS, parallel TSQR-TLS, and parallel TSLU-TLS. The direct and parallel TSQR/TSLU-TLS are recently developed methods in subspace decomposition and are employed in this work for time and frequency estimation. The proposed methods employ a pair of spatially separated sensors to receive multiple incident source signals. A data matrix is constructed in the form of a Hankel matrix from multiple snapshots of the received signal. The information of both TOA and FOA of multiple incident sources is extracted from the data matrix by applying LU/QR techniques (in the first set of the methods) and a tall skinny TSLU/TSQR factorization in the second set. The estimates of the TOA and FOA are obtained from the signal subspace by applying the total least squares (TLS) method. Simulation results are presented to assess the performance of the proposed methods. B Nizar TayemThe effect of parametric variations on the performance has also been analyzed for all the proposed methods. Further, the computational times and complexities of the proposed methods are also computed and compared with each other.Keywords Time of arrival (TOA) · Frequency of arrival (FOA) estimation · QR and LU decomposition · Direct/parallel tall skinny QR/LU decomposition

show abstract

“…Previous FPGA-based implementations have looked at SVD [Brent and Luk (1982)], QRD [Wang and Leeser (2009)] and sparse LUD [Kapre and DeHon (2009)]. However, those approaches all have some limitations in common: either restricted with the scalability of the adapted matrices due to the logic capacity of FPGAs [Brent and Luk (1982); Ahmedsaid et al (2003); Ma et al (2006); Ledesma-Carrillo et al (2011); Wang and Leeser (2009)] or required the input matrices of special property or irregular sparsity structure [Rafique et al (2012);Tai et al (2011); Vachranukunkiet (2007); Kapre and DeHon (2009); Wu et al (2012)].…”

Section: Contributions: Fpga-based Accelerators For Matrix Decompositmentioning

confidence: 99%

Using reconfigurable computing technology to accelerate matrix decomposition and applications

Wang¹

View full text Add to dashboard Cite

Enhancing performance of Tall-Skinny QR factorization using FPGAs

Cited by 14 publications

References 7 publications

Population-Based MCMC on Multi-Core CPUs, GPUs and FPGAs

Population-Based MCMC on Multi-Core CPUs, GPUs and FPGAs

Joint Frequency and Time Estimation Algorithms

Using reconfigurable computing technology to accelerate matrix decomposition and applications

Contact Info

Product

Resources

About