Hardware Acceleration of Matrix Multiplication on a Xilinx FPGA

Dave, Nirav; Fleming, Kermin; King, Myron; Pellauer, Michael; Vijayaraghavan, Muralidaran

doi:10.1109/memcod.2007.371239

Cited by 35 publications

(15 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Then each kernel is expanded consecutively forming the filter matrix F m . As shown in the figure above, many pixels that are included in overlapping kernels will be duplicated in the matrix I m which seems inefficient, though large matrix to matrix multiplications are highly optimizable and parallelizable [6]. This linear algebra computation in Caffe is done using the GEMM function which is highly-tuned in BLAS libraries.…”

Section: A Caffe's Convolution With Gemmmentioning

confidence: 99%

Acceleration of image classification with Caffe framework using FPGA

Danopoulos¹,

Kachris

Soudris³

2018

2018 7th International Conference on Modern Circuits and Systems Technologies (MOCAST)

View full text Add to dashboard Cite

Abstract-Caffe is a deep learning framework, originally developed at UC Berkeley and widely used in large-scale industrial applications such as vision, speech, and multimedia. It supports many different types of deep learning architectures such as CNNs (convolutional neural networks) geared towards image classification and image recognition. In this paper we develop a platform for the efficient deployment and acceleration of Caffe framework on embedded systems that are based on the Zynq SoC. The most computational intensive part of image classification is the processing of the convolution layers of the deep learning algorithms and more specifically the GEMM (general matrix multiplication) function calls. In the proposed framework, a hardware accelerator has been implemented, validated and optimized using Xilinx SDSoC Development Environment to perform the GEMM function. The accelerator that was developed achieves up to 98× speed-up compared with the simple ARM CPU implementation. The results showed that the mapping of Caffe on the FPGA-based Zynq takes advantage of the low-power, customizable and programmable fabric and ultimately reduces time and power consumption of image classification.

show abstract

Section: A Caffe's Convolution With Gemmmentioning

confidence: 99%

Acceleration of image classification with Caffe framework using FPGA

Danopoulos¹,

Kachris

Soudris³

2018

2018 7th International Conference on Modern Circuits and Systems Technologies (MOCAST)

View full text Add to dashboard Cite

show abstract

“…The dense matrix multiplication design proposed by Kumar et al, which is based on the rank one update algorithm, heavily inspired the design of the solver engine [15]. The largest fraction of work has focused specifically on matrix multiplication on FPGAs [16,17,18,19,20,21]. Other works have also considered matrix-vector multiplication and dot products [22] on FPGAS, or matrix multiplication on GPUs [23].…”

Section: Acceleratorsmentioning

confidence: 99%

A hardware acceleration technique for gradient descent and conjugate gradient

Kesler

Deka

Kumar

2011

2011 IEEE 9th Symposium on Application Specific Processors (SASP)

View full text Add to dashboard Cite

Gradient descent, conjugate gradient, and other iterative algorithms are a powerful class of algorithms; however, they can take a long time for convergence. Baseline accelerator designs feature insufficient coverage of operations and do not work well on the problems we target. In this thesis we present a novel hardware architecture for accelerating gradient descent and other similar algorithms. To support this architecture, we also present a sparse matrix-vector storage format, and software support for utilizing the format, so that it can be efficiently mapped onto hardware which is also well suited for dense operations. We show that the accelerator design outperforms similar designs which target only the most dominant operation of a given algorithm, providing substantial energy and performance benefits. We further show that the accelerator can be reasonably implemented on a general purpose CPU with small area overhead.ii ACKNOWLEDGMENTS

show abstract

“…Matrix multiplication is a widely used operation of linear algebra, and therefore many interesting reconfigurable designs have been proposed, each of them satisfying specific requirements : [2] and [3] try to reduce the silicon complexity of the reconfigurable device , while [4] focuses on 64-bit floating point elements, [5] only on squared matrices with double precision floating point elements, [6] on multiplying small matrices, and [7] on improving the silicon cost and energy efficiency of the design. The only designs which are optimized for fixed-point matrix multiplication are: (a) the one by Dave et.…”

Section: Comparison With Existing Reconfigurable Systemsmentioning

confidence: 99%

“…Al. [5], which was proposed at 2007 MEMOCODE's hardware/ software co-design contest, and (b) the subsystem implemented by Zicari et al [10]. Both designs support only square matrices and they split the processing tasks into the reconfigurable hardware resources and the built-in PowerPC of a Xilinx Virtex II Pro device.…”

Section: Comparison With Existing Reconfigurable Systemsmentioning

confidence: 99%

A fast parallel matrix multiplication reconfigurable unit utilized in face recognitions systems

Sotiropoulos

Papaefstathiou

2009

2009 International Conference on Field Programmable Logic and Applications

View full text Add to dashboard Cite

In this paper we present a reconfigurable device which significantly improves the execution time of the most computational intensive functions of three of the most widely used face recognition algorithms; those tasks multiply very large dense matrices. The presented architecture utilizes numerous digital signal processing units (DSPs) organized in a parallel manner within a stateof-the-art FPGA device. In order to accelerate those functions we have implemented a "blocked" matrix multiplication algorithm which multiplies certain submatrices of fixed-point 32-bit numbers; the size of the submatrices has been selected so as to fully exploit the resources of the underlying reconfigurable device. Our system is up to 550 times faster than a conventional general purpose processor when implementing the most CPU intensive parts of a number of very widely used Face Identification Schemes, whereas it is more than 40 times faster than the similar schemes implemented in reconfigurable devices. Moreover, our system is general enough so as to be efficiently utilized in any application incorporating fixed-point matrix multiplications.

show abstract

Hardware Acceleration of Matrix Multiplication on a Xilinx FPGA

Cited by 35 publications

References 2 publications

Acceleration of image classification with Caffe framework using FPGA

Acceleration of image classification with Caffe framework using FPGA

A hardware acceleration technique for gradient descent and conjugate gradient

A fast parallel matrix multiplication reconfigurable unit utilized in face recognitions systems

Contact Info

Product

Resources

About