High-Performance Designs for Linear Algebra Operations on Reconfigurable Hardware

Zhuo, Ling; Prasanna, Viktor K.

doi:10.1109/tc.2008.55

Cited by 81 publications

(46 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[21] proposes a new architecture of bicubic interpolation and implemented it on FPGA. [22] proposes a linear algebra implementations on FPGA, the authors utilizes the overlap technique between I/O and execution time to increase computing speed. [23] shows the procedure of mapping a Jacobi Iterative Solver on FPGA.…”

Section: Related Workmentioning

confidence: 99%

The Implementation of Texture-Based Video Up-Scaling on Coarse-Grained Reconfigurable Architecture

Shi

Yin

Liu

et al. 2015

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYVideo Up-scaling is a hotspot in TV display area; as an important brunch of Video Up-scaling, Texture-Based Video Upscaling (TBVU) method shows great potential of hardware implementation. Coarse-grained Reconfigurable Architecture (CGRA) is a very promising processor; it is a parallel computing platform which provides high performance of hardware, high flexibility of software, and dynamical reconfiguration ability. In this paper we propose an implementation of TBVU on CGRA. We fully exploit the characters of TBVU and utilize several techniques to reduce memory I/O operation and total execution time. Experimental results show that our work can greatly reduce the I/O operation and the execution time compared with the non-optimized ones. We also compare our work with other platforms and find great advantage in execution time and resource utilization rate.

show abstract

Section: Related Workmentioning

confidence: 99%

The Implementation of Texture-Based Video Up-Scaling on Coarse-Grained Reconfigurable Architecture

Shi

Yin

Liu

et al. 2015

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

show abstract

“…Although the PE connection pattern in the form of a tree is also possible [13], the linear list has the advantage of a much more regular structure, which allows simpler routing between PEs and consequently the higher clock frequency. After the initial latency, a list of n PEs multiply two nelement vectors in one clock cycle, or two square matrices of order n in n …”

Section: Accelerator Architecturementioning

confidence: 99%

FPGA accelerator for floating-point matrix multiplication

Jovanovic

Milutinović

2012

IET Comput. Digit. Tech.

View full text Add to dashboard Cite

Abstract:This study treats architecture and implementation of a FPGA accelerator for double-precision floating-point matrix multiplication. The architecture is oriented towards minimising resource utilisation and maximising clock frequency. It employs the block matrix multiplication algorithm which returns the result blocks to the host processor as soon as they are computed. This avoids output buffering, and simplifies placement and routing on the chip. The authors show that such architecture is especially well suited for full-duplex communication links between the accelerator and the host processor. The architecture requires the result blocks to be accumulated by the host processor; however, the authors show that typically more than 99% of all arithmetic operations are performed by the accelerator. The implementation focuses on efficient use of embedded FPGA resources, in order to allow for a large number of processing elements (PEs). Each PE uses 8 Virtex-6 DSP blocks. Both adders and multipliers are deeply pipelined and use several FPGA-specific techniques to achieve small area size and high clock frequency. Finally, the authors quantify the performance of accelerator implemented in Xilinx Virtex-6 FPGA, with 252 PEs running at 403 MHz (achieving 203.1 GFLOPS), by comparing it to DGEMM function from MKL, ACML, GotoBLAS and ATLAS libraries executing on Intel Core2Quad and AMD Phenom X4 microprocessors running at 2.8 GHz. The accelerator performs 4.5 times faster than the fastest processor/library pair.

show abstract

“…The main focus of that work was to examine the potential capacity of FPGAs in performing BLAS operations. The only work that has implemented linear algebra applications on the reconfigurable computing systems is [21]. However, it only employs the FPGAs in the systems.…”

Section: Linear Algebra On Fpgasmentioning

confidence: 99%