Proceedings of the 2005 ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays 2005
DOI: 10.1145/1046192.1046203
|View full text |Cite
|
Sign up to set email alerts
|

Floating-point sparse matrix-vector multiply for FPGAs

Abstract: We also analyze the asymptotic efficiency of our architecture as parallelism scales using a constant rent-parameter matrix model. This demonstrates that our data placement techniques provide an asymptotic scaling benefit. While FPGA performance is attractive, higher performance is possible if we re-balance the hardware resources in FPGAs with embedded memories. We show that sacrificing half the logic area for memory area rarely degrades performance and improves performance for large matrices, by up to 5 times.… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
97
0

Year Published

2006
2006
2018
2018

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 117 publications
(97 citation statements)
references
References 13 publications
0
97
0
Order By: Relevance
“…Using 32 leaf processing FPGAs (512 PEs), we are able to sustain a per leaf processing rate of 3 Gflops. More details on our firstgeneration FPGA-based SMVM implementation are reported in [deLorimier05].…”
Section: Bellman-fordmentioning
confidence: 99%
See 1 more Smart Citation
“…Using 32 leaf processing FPGAs (512 PEs), we are able to sustain a per leaf processing rate of 3 Gflops. More details on our firstgeneration FPGA-based SMVM implementation are reported in [deLorimier05].…”
Section: Bellman-fordmentioning
confidence: 99%
“…For example, on Sparse Matrix-Vector Multiplication (SMVM), processor-based machines typically achieve only 1-15% of their potential performance [deLorimier05]. While caching, banking, DMA block transfer, and strided prefetch allow these machines to efficiently process dense matrix operations or regular graphs, large data structures coupled with irregular data access defeat these simple optimizations.…”
Section: Introductionmentioning
confidence: 99%
“…This led researchers to begin by focusing on kernel operations that are used in HPC and can be provided through a standard library interface. Operations from BLAS [Underwood and Hemmert 2004;Zhuo and Prasanna 2004;Dou et al 2005;Zhuo and Prasanna 2005a;Zhuo and Prasanna 2005b] to FFTs [Hemmert and Underwood 2005] to the sparse matrix operations at the core of an iterative solver [deLorimier and DeHon 2005;Zhuo and Prasanna 2005c] and even a full CG solver [Morris et al 2006] have been studied. The fundamental challenge for each of these efforts is the communications with the host.…”
Section: Introductionmentioning
confidence: 99%
“…To explore this acceleration, a number of different hardware architectures have been investigated. These architectures include, Connection Machines [11], Cell Processors [12], Graphical Processing Units (GPUs) [13] and FPGAs [14]. A widely implemented comparative benchmark for floating-point computations is the General Matrix Multiply (GEMM) subroutine, part of the Basic Linear Algebra Subprograms (BLAS) library [15].…”
Section: Architectures For Scientific Computationmentioning
confidence: 99%
“…Due to the domination of the algorithm by inner-products, known to map well to FPGAs [14][25], CG is well suited, even for small dense systems. The FPGA allows the construction of a data-path specialised not only to the CG algorithm, but to the order of the matrix.…”
Section: Previous Fpga Implementationsmentioning
confidence: 99%