A High-Performance CNN Processor Based on FPGA for MobileNets

Wu, Di; Zhang, Yu; Jia, Xijie; Lan, Tian; Li, Tianping; Sui, Lingzhi; Xie, Dong; Shan, Yi

doi:10.1109/fpl.2019.00030

Cited by 108 publications

(34 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Work [47] improved the computational parallelism by separating the convolution computation and other data processing such as pooling and full connection. Bai et al [24] and Wu et al [22] proposed specific CNN accelerators for implementing depth-wise separable convolution onto FPGA. However, these works did not use fast algorithms to reduce the computational cost of convolution operations.…”

Section: Related Workmentioning

confidence: 99%

“…In practical applications, fieldprogrammable gate array (FPGA) is a popular option for designing AUVs due the conveniences of expanding peripheral interfaces and customizing special hardware control logic. Compared with focusing on balancing the computation parallelism and the memory bandwidth [16][17][18][19][20][21][22][23][24], the research focusing on optimizing the implementation of convolution computation onto FPGA have attracted more attention recently. Converted convolution into general matrix multiplication (GEMM) could reduce the times of memory access [25].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Real-Time FPGA Accelerator Based on Winograd Algorithm for Underwater Object Detection

Cai

Wang

2021

Electronics

View full text Add to dashboard Cite

Real-time object detection is a challenging but crucial task for autonomous underwater vehicles because of the complex underwater imaging environment. Resulted by suspended particles scattering and wavelength-dependent light attenuation, underwater images are always hazy and color-distorted. To overcome the difficulties caused by these problems to underwater object detection, an end-to-end CNN network combined U-Net and MobileNetV3-SSDLite is proposed. Furthermore, the FPGA implementation of various convolution in the proposed network is optimized based on the Winograd algorithm. An efficient upsampling engine is presented, and the FPGA implementation of squeeze-and-excitation module in MobileNetV3 is optimized. The accelerator is implemented on a Zynq XC7Z045 device running at 150 MHz and achieves 23.68 frames per second (fps) and 33.14 fps when using MobileNetV3-Large and MobileNetV3-Small as the feature extractor. Compared to CPU, our accelerator achieves 7.5×–8.7× speedup and 52×–60× energy efficiency.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A Real-Time FPGA Accelerator Based on Winograd Algorithm for Underwater Object Detection

Cai

Wang

2021

Electronics

View full text Add to dashboard Cite

show abstract

“…Meanwhile, many excellent algorithms have designed to accelerate the basic functions of CNN inference. In [23], two dedicated computing engines named Conv Engine and Dwcv Engine were designed for pointwise convolution and depthwise convolution to improve the efficiency. In [15], the authors aimed to accelerate sparse CNNs.…”

Section: Related Workmentioning

confidence: 99%

An Efficient Task Assignment Framework to Accelerate DPU-Based Convolutional Neural Network Inference on FPGAs

et al. 2020

View full text Add to dashboard Cite

Field Programmable Gate Array (FPGA) has become an efficient accelerator for convolutional neural network (CNN) inference due to its high performance and flexibility. To further improve the performance of CNN inference on FPGAs, an Intellectual Property core (IP core) called Deep Learning Processor Unit (DPU) is released by Xilinx. Unlike previous FPGA-based hardware designs focusing on specific functions or CNNs, the DPU IP supports ample basic functions of deep learning, and the developers can take advantage of DPUs to accelerate CNN inference conveniently. In DPU-based CNN acceleration platform, an encapsulated scheduler plays a crucial role in task scheduling between heterogeneous ARM and multiple DPUs. However, the current scheduler is unsatisfactory because its low schedule efficiency. This paper thus presents a high performance task assignment framework built upon Xilinx hybrid CPU-FPGA MPSoC devices. We first evaluate the main causes of low schedule efficiency problem. Then, we explore the scheduler rules and improve shedule efficiency through purposeful observations and analysis. Finally, we integrate our optimizations, and propose an efficient task assignment framework to maximize performance on DPU-based CNN acceleration platform. Experimental results on Xilinx Zynq UltraScale+ MPSoC zcu104 show that our efficient task assignment framework significantly boosts schedule efficiency for small-scale CNNs (from 36% to 70%), medium-scale CNNs (from 65% to 95%), and large-scale CNNs (from 77% to 99%) compared with original schedule strategy. INDEX TERMS Field programmable gate array (FPGA), deep learning processor unit (DPU), convolutional neural network (CNN) accelerator, schedule efficiency.

show abstract

“…They use a smaller FPGA, but we expect that if they scaled up to a larger FPGA, their DSP-to-logic utilization ratio would remain roughly the same and their accelerator would still be unable to take advantage of all of the available multipliers. Table IV shows a comparison of HPIPE to the V100 GPU running MobileNet-V1 and a comparison of HPIPE to the FPGA accelerator from Wu et al [27] running MobileNet-V2. NVIDIA does not report accuracy for their implementation of MobileNet-V1.…”

Section: B Sparse Cnn On Fpgamentioning

confidence: 99%

HPIPE: Heterogeneous Layer-Pipelined and Sparse-Aware CNN Inference for FPGAs

Hall

Betz

2020

Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

View full text Add to dashboard Cite

We present both a novel Convolutional Neural Network (CNN) accelerator architecture and a network compiler for FPGAs that outperforms all prior work. Instead of having generic processing elements that together process one layer at a time, our network compiler statically partitions available device resources and builds custom-tailored hardware for each layer of a CNN. By building hardware for each layer we can pack our controllers into fewer lookup tables and use dedicated routing. These efficiencies enable our accelerator to utilize 2x the DSPs and operate at more than 2x the frequency of prior work on sparse CNN acceleration on FPGAs. We evaluate the performance of our architecture on both sparse Resnet-50 and dense MobileNet Imagenet classifiers on a Stratix 10 2800 FPGA. We find that the sparse Resnet-50 model has throughput at a batch size of 1 of 4550 images/s, which is nearly 4x the throughput of NVIDIA's fastest machine learning targeted GPU, the V100, and outperforms all prior work on FPGAs.

show abstract

A High-Performance CNN Processor Based on FPGA for MobileNets

Cited by 108 publications

References 9 publications

A Real-Time FPGA Accelerator Based on Winograd Algorithm for Underwater Object Detection

A Real-Time FPGA Accelerator Based on Winograd Algorithm for Underwater Object Detection

An Efficient Task Assignment Framework to Accelerate DPU-Based Convolutional Neural Network Inference on FPGAs

HPIPE: Heterogeneous Layer-Pipelined and Sparse-Aware CNN Inference for FPGAs

Contact Info

Product

Resources

About