2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PHD Forum 2011
DOI: 10.1109/ipdps.2011.165
|View full text |Cite
|
Sign up to set email alerts
|

An FPGA-Based Accelerator to Speed-Up Matrix Multiplication of Floating Point Operations

Abstract: Field Programmable Gate Arrays (FPGAs) are able to provide a high computational parallelism that can be exploited to achieve high performance improvements in intensive data processing problems. In this paper our efforts were directed towards developing a PC cluster based on nodes that use FPGAs as co-processors. The target application is a floating-point large dense matrix multiplication. Experimental results for just one node of the cluster, consisting of a Xilinx Virtex 5 VLX50T with a PCI interface, showed … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
5
0
1

Year Published

2012
2012
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 9 publications
(6 citation statements)
references
References 6 publications
0
5
0
1
Order By: Relevance
“…Most state-of-the-art implementations reported achieve peak throughputs ranging between 10 to 20 GFLOPS, although nearly 30 sustained GFLOPS are reported in [10] when targeting a large Virtex-V FPGA (XC5SX240T). Our implementation targets the same FPGA as [12], which permits a comparison based on identical technologies. We obtain as much as 18.1 GFLOPS with 40 PE on a Stratix-III FPGA (EP3S150), and 26.9 GFLOPS with 56 PE on a larger FPGA (EP3SE260), while also offering a good scalability.…”
Section: B Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Most state-of-the-art implementations reported achieve peak throughputs ranging between 10 to 20 GFLOPS, although nearly 30 sustained GFLOPS are reported in [10] when targeting a large Virtex-V FPGA (XC5SX240T). Our implementation targets the same FPGA as [12], which permits a comparison based on identical technologies. We obtain as much as 18.1 GFLOPS with 40 PE on a Stratix-III FPGA (EP3S150), and 26.9 GFLOPS with 56 PE on a larger FPGA (EP3SE260), while also offering a good scalability.…”
Section: B Discussionmentioning
confidence: 99%
“…In the field of matrix multiplication, high-performance GPU implementations will deliver as much as 393 GFLOPS [5] on high-end video cards such as Nvidia's GeForce GTX280. The FPGA implementations of matrix multiplication proposed in [6]- [12] can deliver up to 29.8 GFLOPS but their energy efficiency is better. There is a growing interest in using FPGAs as coprocessing units for hardware acceleration, but the hardware design productivity gap remains because the design and optimization of large circuits is long, complex and reserved to experienced designers.…”
Section: Introductionmentioning
confidence: 99%
“…For example, in a wide range of matrix multipliers [1, 5,17], the multiply-accumulate operation is performed using a separate FP multiplier and FP adder. Enhancements for matrix multiplications can be obtained if dedicated FP accumulators are used to add the result of FP multipliers [4,[18][19][20].…”
Section: Fp Maf On Fpgamentioning
confidence: 99%
“…Multiply-add fused (multiplication followed by addition/ subtraction) -MAFrepresents the most common arithmetic operation in applications which require matrix/ vector multiplications or convolutions. Such applications include graphic processing, multimedia, image and video processing, DSP, scientific computing and so on [1][2][3][4][5]. Employing a dedicated hardware floating-point (FP) MAF a unit for the combined operation has several advantages compared to the solution made of two distinct units (a FP adder and a FP multiplier) [6][7][8]:…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation