An FPGA-Based Accelerator to Speed-Up Matrix Multiplication of Floating Point Operations

Holanda, Bruno; Pimentel, Rodrigo; Barbosa, João Paulo; Camarotti, R.; Silva-Filho, Abel G.; Joao, L.; Souza, Victor L. F.; Ferraz, Juliana Carvalho Ferreira; Lima, Maria Luíza Carvalho de

doi:10.1109/ipdps.2011.165

Cited by 9 publications

(6 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Most state-of-the-art implementations reported achieve peak throughputs ranging between 10 to 20 GFLOPS, although nearly 30 sustained GFLOPS are reported in [10] when targeting a large Virtex-V FPGA (XC5SX240T). Our implementation targets the same FPGA as [12], which permits a comparison based on identical technologies. We obtain as much as 18.1 GFLOPS with 40 PE on a Stratix-III FPGA (EP3S150), and 26.9 GFLOPS with 56 PE on a larger FPGA (EP3SE260), while also offering a good scalability.…”

Section: B Discussionmentioning

confidence: 99%

“…In the field of matrix multiplication, high-performance GPU implementations will deliver as much as 393 GFLOPS [5] on high-end video cards such as Nvidia's GeForce GTX280. The FPGA implementations of matrix multiplication proposed in [6]- [12] can deliver up to 29.8 GFLOPS but their energy efficiency is better. There is a growing interest in using FPGAs as coprocessing units for hardware acceleration, but the hardware design productivity gap remains because the design and optimization of large circuits is long, complex and reserved to experienced designers.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Synchronized-transfer-level design methodology applied to hardware matrix multiplication

Daigneault

David

2012

2012 International Conference on Reconfigurable Computing and FPGAs

View full text Add to dashboard Cite

In an effort to reduce the productivity gap separating hardware design and software programming practices, this paper presents the application of our synchronizedtransfer-level hardware design methodology to the implementation of a hardware matrix multiplication accelerator. The methodology builds on a hardware description language for which the designer manages dynamic connections between sources and sinks that may not always be ready to send or receive data tokens. In addition to these connections, the designer can constrain the authorization of data transfers by the means of logical rules that make transfers dependant on each other. Combining both finite state machine and constraint programming paradigms, the featured description language enhances the ability to express and exploit low-level parallelism. A compiler automates the generation and the optimization of the synchronization logic, whose low-level complexity is thus hidden to the designer. Applied to the design of the pipelined matrix multiplication circuit, the proposed methodology leads to similar computing performances than the dedicated designs reported in the literature but within shorter design times (a single day), simpler source code and no need for advanced hardware design skills.

show abstract

Section: B Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Synchronized-transfer-level design methodology applied to hardware matrix multiplication

Daigneault

David

2012

2012 International Conference on Reconfigurable Computing and FPGAs

View full text Add to dashboard Cite

show abstract

“…For example, in a wide range of matrix multipliers [1, 5,17], the multiply-accumulate operation is performed using a separate FP multiplier and FP adder. Enhancements for matrix multiplications can be obtained if dedicated FP accumulators are used to add the result of FP multipliers [4,[18][19][20].…”

Section: Fp Maf On Fpgamentioning

confidence: 99%

“…Multiply-add fused (multiplication followed by addition/ subtraction) -MAFrepresents the most common arithmetic operation in applications which require matrix/ vector multiplications or convolutions. Such applications include graphic processing, multimedia, image and video processing, DSP, scientific computing and so on [1][2][3][4][5]. Employing a dedicated hardware floating-point (FP) MAF a unit for the combined operation has several advantages compared to the solution made of two distinct units (a FP adder and a FP multiplier) [6][7][8]:…”

Section: Introductionmentioning

confidence: 99%

“…Multiply‐add fused (multiplication followed by addition/subtraction) – MAF – represents the most common arithmetic operation in applications which require matrix/vector multiplications or convolutions. Such applications include graphic processing, multimedia, image and video processing, DSP, scientific computing and so on [1–5]. Employing a dedicated hardware floating‐point (FP) MAF a unit for the combined operation has several advantages compared to the solution made of two distinct units (a FP adder and a FP multiplier) [6–8]: Increased precision – only one rounding operation is performed instead of two. Reduced latency and reduced cost – only one set of align, normalisation and rounding operations is needed. Furthermore, both FP addition and multiplication operations can be performed by a MAF unit; however, lower performance is obtained for the separate operations with respect to a FP addition/multiplier [7, 9].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Low‐precision DSP‐based floating‐point multiply‐add fused for Field Programmable Gate Arrays

Amăricăi

Boncalo

Gavriliu

2014

IET Computers & Digital Techniques

View full text Add to dashboard Cite

Floating-point (FP) multiply-add fused (F 1 *F 2 ± F 3) and multiply-accumulate represent the most common arithmetic operation in a wide range of applications, such as graphic processing, multimedia or FP digital signal processing (DSP). This study proposes FP multiply-add fused units for low-precision formats (IEEE 16-bit half precision or the 32-bit single precision) which rely on modern Field Programmable Gate Array (FPGA) features such as the available integer multiplyaccumulate-based support built-in the FPGA DSP blocks. These are employed as building-blocks within the mantissa datapath processing for the multiplication and the add/subtract operations. In order to use the DSP block for these operations, the alignment right shifts are performed before the multiply-add stage: a right shift on the addend, and, a right shift for one of the multiplicands. This results in efficient DSP usage; thus both cost savings and higher performance (high working frequencies and low latencies) are obtained for the multiply-add fused operation.

show abstract

Fast description and synthesis of control-dominant circuits

Daigneault

David

2014

Computers & Electrical Engineering

View full text Add to dashboard Cite

An FPGA-Based Accelerator to Speed-Up Matrix Multiplication of Floating Point Operations

Cited by 9 publications

References 6 publications

Synchronized-transfer-level design methodology applied to hardware matrix multiplication

Synchronized-transfer-level design methodology applied to hardware matrix multiplication

Low‐precision DSP‐based floating‐point multiply‐add fused for Field Programmable Gate Arrays

Fast description and synthesis of control-dominant circuits

Contact Info

Product

Resources

About