Evaluating Low-Memory GEMMs for Convolutional Neural Network Inference on FPGAs

Zhang, Wentai; Jiang, Ming; Luo, Guojie

doi:10.1109/fccm48280.2020.00013

Cited by 6 publications

(6 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Modern processors accommodate general matrix multiplication (GEMM) libraries to accelerate basic linear algebra calculations. Zhang et al first introduced this idea to FPGA-based CNN design by using the "im2col" method to transform the calculation of CLs into matrix multiplication through reshaping and packing [15]. However, the "im2col" + GEMM approach requires significant on-chip memory to store the transformed image matrix, making it impractical for mid-low end FPGA platforms.…”

Section: Related Workmentioning

confidence: 99%

Flare: An FPGA-Based Full Precision Low Power CNN Accelerator with Reconfigurable Structure

Xu,

Luo,

Sun

2024

Sensors

View full text Add to dashboard Cite

Convolutional neural networks (CNNs) have significantly advanced various fields; however, their computational demands and power consumption have escalated, posing challenges for deployment in low-power scenarios. To address this issue and facilitate the application of CNNs in power constrained environments, the development of dedicated CNN accelerators is crucial. Prior research has predominantly concentrated on developing low precision CNN accelerators using code generated from high-level synthesis (HLS) tools. Unfortunately, these approaches often fail to efficiently utilize the computational resources of field-programmable gate arrays (FPGAs) and do not extend well to full precision scenarios. To overcome these limitations, we integrate vector dot products to unify the convolution and fully connected layers. By treating the row vector of input feature maps as the fundamental processing unit, we balance processing latency and resource consumption while eliminating data rearrangement time. Furthermore, an accurate design space exploration (DSE) model is established to identify the optimal design points for each CNN layer, and dynamic partial reconfiguration is employed to maximize each layer’s access to computational resources. Our approach is validated through the implementation of AlexNet and VGG16 on 7A100T and ZU15EG platforms, respectively. We achieve an average convolutional layer throughput of 28.985 GOP/s and 246.711 GOP/s for full precision. Notably, the proposed accelerator demonstrates remarkable power efficiency, with a maximum improvement of 23.989 and 15.376 times compared to current state-of-the-art FPGA implementations.

show abstract

Section: Related Workmentioning

confidence: 99%

Flare: An FPGA-Based Full Precision Low Power CNN Accelerator with Reconfigurable Structure

Xu,

Luo,

Sun

2024

Sensors

View full text Add to dashboard Cite

show abstract

“…[13,28,29] focus on frequency-domain CNN acceleration which are usually advantageous for large kernels. [30] used kn2row and achieved state-of-the-art speedup on Inception v-4 which has more memorybounded layers. While these works adopt certain strategies to accelerate CNN inference, they do not explore the flexible switching of different algorithms within the same network.…”

Section: Motivationmentioning

confidence: 99%

“…Then Pad-and-Accumulate Module shifts each intermediate output patch to align with the position determined by the original 𝐾 1 × 𝐾 2 kernel and produces the final output feature maps using an accumulation buffer. The bank indices and address offsets for Pad-and-Accumulation are precomputed based on CONV layer meta data [30]. The "unit-CONV GEMM" and "Pad-and-Accumulate" phases are pipelined enabling CU to start working on the next patch of unit-CONV GEMM while accumulation buffer still processes the last patch.…”

Section: Architecture Overviewmentioning

confidence: 99%

“…Even if we scale down the systolic array size (2 DSP consumption per PE), in the worst case the performance will be halved and we still achieve 2× and 1.4× lower latency compared with [12] and [27] respectively. For Inception-v4, we compare with [31] which applies dynamic memory management to overcome data transfer bottlenecks and [25] that uses kn2row method for all layers in GoogleNet. Compared to [25], even with lower frequency, our design achieves 20% speedup.…”

Section: Evaluation Of Optimizationsmentioning

confidence: 99%

“…As CNNs continue to be more compute-and memory-intensive, FPGAs become promising candidates for CNN inference implementation with superior latency and reduced energy consumption. To avoid expensive run-time reconfiguration, most existing FPGA-based solutions usually use a specific algorithm across all the layers and reuse a generic architecture to speed up the algorithm [13,14,26,28,30]. These approaches leave some performance and hardware efficiency on the table due to (1) use of a fixed algorithm across diverse layers and (2) under-utilization due to fixed hardware.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

DYNAMAP: Dynamic Algorithm Mapping Framework for Low Latency CNN Inference

Meng,

Kuppannagari,

Kannan

et al. 2020

Preprint

View full text Add to dashboard Cite

Most of the existing work on FPGA acceleration of Convolutional Neural Network (CNN) focus on employing a single strategy (algorithm, dataflow, etc.) across all the layers. Such an approach does not achieve optimal latency on complex and deep CNNs. Emerging CNNs have diverse per-layer computation characteristics including parallelism, arithmetic intensity, locality, and memory footprint. Per-layer strategy selection and fine-grained tuning is required to achieve low end-to-end latency. However, specialized hardware modules dedicated for each layer limit the per-layer utilization and adversely affect end-to-end latency. In this paper, we address these problems by an algorithm-architecture co-optimization framework, DYNAMAP, consisting of (1) a unified hardware overlay that can be reused across layers, supporting dynamic mapping of all three families of popular convolution algorithms, and further allowing flexible dataflow switching to maximize hardware utilization for each layer; (2) a novel software Design Space Exploration (DSE) flow that customizes the hardware overlay and chooses optimal strategy mapping. We show that the algorithm mapping space increases exponentially with network depth, and while the optimal algorithm selection problem is NP-hard in general, by exploiting the series-parallel structure of CNN models, we demonstrate a polynomial-time solution for optimal algorithm mapping. DYNAMAP is optimized for any CNN, including those having diverse computation and memory requirements across the layers. We demonstrate DYNAMAP using two state-of-the-art CNNs -GoogleNet and Inception-V4. The generated accelerators achieve up to 2.8× and 1.4× speedups, respectively, wrt inference latency compared with the state-of-the-art FPGA implementations.

show abstract