2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) 2020
DOI: 10.1109/fccm48280.2020.00013
|View full text |Cite
|
Sign up to set email alerts
|

Evaluating Low-Memory GEMMs for Convolutional Neural Network Inference on FPGAs

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
6
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(6 citation statements)
references
References 19 publications
0
6
0
Order By: Relevance
“…Modern processors accommodate general matrix multiplication (GEMM) libraries to accelerate basic linear algebra calculations. Zhang et al first introduced this idea to FPGA-based CNN design by using the "im2col" method to transform the calculation of CLs into matrix multiplication through reshaping and packing [15]. However, the "im2col" + GEMM approach requires significant on-chip memory to store the transformed image matrix, making it impractical for mid-low end FPGA platforms.…”
Section: Related Workmentioning
confidence: 99%
“…Modern processors accommodate general matrix multiplication (GEMM) libraries to accelerate basic linear algebra calculations. Zhang et al first introduced this idea to FPGA-based CNN design by using the "im2col" method to transform the calculation of CLs into matrix multiplication through reshaping and packing [15]. However, the "im2col" + GEMM approach requires significant on-chip memory to store the transformed image matrix, making it impractical for mid-low end FPGA platforms.…”
Section: Related Workmentioning
confidence: 99%
“…[13,28,29] focus on frequency-domain CNN acceleration which are usually advantageous for large kernels. [30] used kn2row and achieved state-of-the-art speedup on Inception v-4 which has more memorybounded layers. While these works adopt certain strategies to accelerate CNN inference, they do not explore the flexible switching of different algorithms within the same network.…”
Section: Motivationmentioning
confidence: 99%
“…Then Pad-and-Accumulate Module shifts each intermediate output patch to align with the position determined by the original 𝐾 1 × 𝐾 2 kernel and produces the final output feature maps using an accumulation buffer. The bank indices and address offsets for Pad-and-Accumulation are precomputed based on CONV layer meta data [30]. The "unit-CONV GEMM" and "Pad-and-Accumulate" phases are pipelined enabling CU to start working on the next patch of unit-CONV GEMM while accumulation buffer still processes the last patch.…”
Section: Architecture Overviewmentioning
confidence: 99%
See 2 more Smart Citations