2019
DOI: 10.1145/3337929
|View full text |Cite
|
Sign up to set email alerts
|

Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing

Abstract: The pursuit of many research questions requires massive computational resources. State-of-the-art research in physical processes using simulations, the training of neural networks for deep learning, or the analysis of big data are all dependent on the availability of sufficient and performant computational resources. For such research, access to a highperformance computing infrastructure is indispensable.Many scientific workloads from such research domains are inherently parallel and can benefit from the data-… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
10
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 16 publications
(10 citation statements)
references
References 20 publications
0
10
0
Order By: Relevance
“…Indeed, they tremendously ease the generation of highly customizable modules and enable the birth of System on Chip generators [4], [5]. However, architecture modeling is often case-specific [6], and the support for automatic DSE is still missing. Nowadays, Computer-Aided Design (CAD) tools are vital to increase productivity, ensure correctness and performance of intricate designs, and enable a higher level of complexity.…”
Section: Introductionmentioning
confidence: 99%
“…Indeed, they tremendously ease the generation of highly customizable modules and enable the birth of System on Chip generators [4], [5]. However, architecture modeling is often case-specific [6], and the support for automatic DSE is still missing. Nowadays, Computer-Aided Design (CAD) tools are vital to increase productivity, ensure correctness and performance of intricate designs, and enable a higher level of complexity.…”
Section: Introductionmentioning
confidence: 99%
“…For that, we select TVM due to its proven functionality, active community, and support for quantized computations [8,9]. We furthermore choose bit-serial forms of computation on ARM processors, which allows for arbitrary reduced-precision representations on various platforms [10]. The bit serial approach does not scale according to reduced data size and moreover, it does not seem to be bound by cache bandwidth, at least not in regard to the cache-bound model discussed in this work.…”
Section: Introductionmentioning
confidence: 99%
“…Architectures that are more recent are reported in [30][31][32][33]. In [30] a generic system architecture is proposed for binary string comparisons that is based on a Virtex Ul-traScale+ FPGA.…”
mentioning
confidence: 99%
“…A LUT-efficient compressor architecture for performing population count operation is described in [31] to be used in matrix multiplications of variable precision. The authors started with a population count unit built as a tree of 6:3 LUTs and adders requiring a large number of LUTs and many stages to pipeline the adder tree.…”
mentioning
confidence: 99%
See 1 more Smart Citation