2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines 2012
DOI: 10.1109/fccm.2012.12
|View full text |Cite
|
Sign up to set email alerts
|

Towards a Universal FPGA Matrix-Vector Multiplication Architecture

Abstract: We present the design and implementation of a universal, single-bitstream library for accelerating matrixvector multiplication using FPGAs. Our library handles multiple matrix encodings ranging from dense to multiple sparse formats. A key novelty in our approach is the introduction of a hardware-optimized sparse matrix representation called Compressed Variable-Length Bit Vector (CVBV), which reduces the storage and bandwidth requirements up to 43% (on average 25%) compared to compressed sparse row (CSR) across… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
35
0

Year Published

2015
2015
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 64 publications
(35 citation statements)
references
References 20 publications
0
35
0
Order By: Relevance
“…7, the proposed accelerator can obtain higher performance for most of the test matrices, compared with the implementations on the Convey HC2ex platform with four Virtex-6 LX760 FPGAs [13], HC-1 [12] and Tesla S1070 [7]. With the number of the nonzero block in one block row and the density of one increasing, the performance improvement can be higher.…”
Section: Performance Comparisonmentioning
confidence: 96%
See 1 more Smart Citation
“…7, the proposed accelerator can obtain higher performance for most of the test matrices, compared with the implementations on the Convey HC2ex platform with four Virtex-6 LX760 FPGAs [13], HC-1 [12] and Tesla S1070 [7]. With the number of the nonzero block in one block row and the density of one increasing, the performance improvement can be higher.…”
Section: Performance Comparisonmentioning
confidence: 96%
“…However, the overhead of the word-level-encoded index data of each nonzero element limits the performance improvement. As the works in [7,8,9], the overhead can be reduced by replacing the indices with bitmap, and the indices are retrieved through the decoding before the computing. However, the performance of these works is restricted by the idle cycles in the index decoding and the zero fillings in the bitmap.…”
Section: Related Workmentioning
confidence: 99%
“…More recently, the focus has shifted to efficient use of on-chip memory resources and DRAM bandwidth utilisation [5], [7], [9]. Recently, compression techniques have been proposed to improve the performance on memory bound matrices [8], [17] The constant sparsity structure in the context of iterative methods has also been exploited to optimise FPGA architectures for SpMV [18]. Static one-off pre-processing techniques are cost-effective for FPGA implementations if they can lead either to a simplified architecture [5], [7], [19] or reduced communication overhead [8], [17].…”
mentioning
confidence: 99%
“…Recently, compression techniques have been proposed to improve the performance on memory bound matrices [8], [17] The constant sparsity structure in the context of iterative methods has also been exploited to optimise FPGA architectures for SpMV [18]. Static one-off pre-processing techniques are cost-effective for FPGA implementations if they can lead either to a simplified architecture [5], [7], [19] or reduced communication overhead [8], [17]. Linear or log-linear preprocessing techniques with good performance in practice, such as the method used in this work for extracting matrix properties, have been found to be effective.…”
mentioning
confidence: 99%
See 1 more Smart Citation