A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator

Huang, Jiye; Liu, Xin; Guo, Tongdong; Zhao, Zhijin

doi:10.3390/electronics12071571

Cited by 5 publications

(2 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similarly, Jang J G et al [34] found that depthwise separable convolutions, especially when combined with Falcon convolution, can decrease computational demands while preserving accuracy. Furthermore, Huang J et al [35] developed an FPGAbased DSC accelerator, enhancing on-chip and off-chip efficiency. These studies underline the value of depthwise separable convolutions in improving computational efficiency and model performance.…”

Section: Utilizing Convolutional Neural Network For Autofocusingmentioning

confidence: 99%

Enhanced Autofocusing with a Compact and Swift ST-VGG Network

Li,

Zhao,

yang

2024

Preprint

View full text Add to dashboard Cite

In autofocus systems, the rapid and precise determination of the focusing distance is crucial for capturing clear images, presenting a challenge that requires efficient and effective algorithms. To enhance this process, we introduce a novel autofocus mechanism employing a deep learning approach based on an optimized VGG network architecture, designated as ST-VGG. This method adapts the VGG16 framework by refining its convolutional layers and optimizing parameters to improve efficiency, notably incorporating global average pooling to reduce overfitting risks. Moreover, we integrate the Inception architecture to enable feature assimilation across multiple scales without additional computational burdens, enhancing the model's analytical depth and efficiency. Significantly, by adopting depthwise separable convolution, we achieve a reduction in model complexity and computational demand, making ST-VGG more suitable for mobile and embedded applications. We frame autofocus as a regression problem and, through extensive validation against a comprehensive dataset, demonstrate that our model outperforms existing compact networks (TVGG, TSwinT, TViT) and established architectures (LeNet, AlexNet) in terms of prediction accuracy and consistency, with minimal bias. The ST-VGG model markedly reduces the number of parameters required for training, decreases validation loss by 0.02, and achieves an impressive inference speed of 1.4 ms per image. These advancements not only elevate the model's efficacy but also mitigate the GPU memory requirements for neural network training, offering significant potential benefits for applications in life sciences and micro robotics.

show abstract

Section: Utilizing Convolutional Neural Network For Autofocusingmentioning

confidence: 99%

Enhanced Autofocusing with a Compact and Swift ST-VGG Network

Li,

Zhao,

yang

2024

Preprint

View full text Add to dashboard Cite

show abstract

“…Among these options, FPGAs have gained popularity for implementing CNNs in embedded systems primarily due to their ability to perform convolution operations in parallel with high energy efficiency. Compared to GPUs, FPGAs offer higher energy efficiency, making them an attractive choice for resource-constrained embedded systems [4][5][6][7].…”

Section: Introductionmentioning

confidence: 99%

Efficient Two-Stage Max-Pooling Engines for an FPGA-Based Convolutional Neural Network

Hong,

Choi,

Joo

2023

Electronics

View full text Add to dashboard Cite

This paper proposes two max-pooling engines, named the RTB-MAXP engine and the CMB-MAXP engine, with a scalable window size parameter for FPGA-based convolutional neural network (CNN) implementation. The max-pooling operation for the CNN can be decomposed into two stages, i.e., a horizontal axis max-pooling operation and a vertical axis max-pooling operation. These two one-dimensional max-pooling operations are performed by tracking the rank of the values within the window in the RTB-MAXP engine and cascading the maximum operations of the values in the CMB-MAXP engine. Both the RTB-MAXP engine and the CMB-MAXP engine were implemented using VHSIC hardware description language (VHDL) and verified by simulations. The implementation results demonstrate that the 16 CMB-MAXP engines achieved a remarkable throughput of about 9 GBPS (gigabytes per second) while utilizing only about 3% of the available resources on the Xilinx Virtex UltraScale+ FPGA XCVU9P. On the other hand, the 16 RTB-MAXP engines exhibited somewhat lower throughput and resource utilization, although they did offer a slightly better latency when compared to the CMB-MAXP engines. In the comparison with existing techniques, the CMB-MAXP engine exhibited comparable implementation results in terms of the resource utilization and maximum operating frequency. It is crucial to note that only the proposed engines provide the features of runtime window scalability and boundary padding capability, which are essential requirements for CNN accelerators. The proposed max-pooling engines were employed and tested in our CNN accelerator targeting the CNN model YOLOv4-CSP-S-Leaky for object detection.

show abstract