Accelerating Neural Network Inference on FPGA-Based Platforms—A Survey

Wu, Ran; Guo, Xuguang; Du, Jian; Li, Junbao

doi:10.3390/electronics10091025

Cited by 58 publications

(25 citation statements)

References 107 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…By storing and efficiently reusing data within the memory hierarchy, the number of times accesses are made to costlier memories can be greatly reduced. Additionally, authors in [1] and [105] also provide different data-flow schemes that exploit the aforementioned data reuse opportunities,…”

Section: B Digital Acceleratorsmentioning

confidence: 99%

A Survey on the Optimization of Neural Network Accelerators for Micro-AI On-Device Inference

Mazumder

Meng

Rashid

et al. 2021

IEEE J. Emerg. Sel. Topics Circuits Syst.

View full text Add to dashboard Cite

Deep neural networks (DNNs) are being prototyped for a variety of artificial intelligence (AI) tasks including computer vision, data analytics, robotics, etc. The efficacy of DNNs coincides with the fact that they can provide state-ofthe-art inference accuracy for these applications. However, this advantage comes from the high computational complexity of the DNNs in use. Hence, it is becoming increasingly important to scale these DNNs so that they can fit on resource-constrained hardware and edge devices. The main goal is to allow efficient processing of the DNNs on low-power micro-AI platforms without compromising hardware resources and accuracy. In this work, we aim to provide a comprehensive survey about the recent developments in the domain of energy-efficient deployment of DNNs on micro-AI platforms. To this extent, we look at different neural architecture search strategies as part of micro-AI model design, provide extensive details about model compression and quantization strategies in practice, and finally elaborate on the current hardware approaches towards efficient deployment of the micro-AI models on hardware. The main takeaways for a reader from this article will be understanding of different search spaces to pinpoint the best micro-AI model configuration, ability to interpret different quantization and sparsification techniques, and the realization of the micro-AI models on resource-constrained hardware and different design considerations associated with it.

show abstract

Section: B Digital Acceleratorsmentioning

confidence: 99%

A Survey on the Optimization of Neural Network Accelerators for Micro-AI On-Device Inference

Mazumder

Meng

Rashid

et al. 2021

IEEE J. Emerg. Sel. Topics Circuits Syst.

View full text Add to dashboard Cite

show abstract

“…Hardware resources required for each can be estimated from the generated generic PEs and layer parameters (Eqs. [11][12][13][14], the estimations allow the tool to generate multiple designs based on resource limitations and user input. Block Ram (BRAM) tiles are currently assumed to be RAMB36E1 which can be used as RAMB18E1 and FIFO18E1 [34] when needed.…”

Section: Proposed Convolutional Layer Designmentioning

confidence: 99%

“…This is crucial given the rapid changes in CNN architectures. Until recently, most works only investigated the hardware implementation of forward pass CNNs as inference engines and accelerators, there is plenty of research done to map the CNN forward pass unto FPGAs for embedded inference [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18], in contrast, there is a clear lack of work in the areas of online deployment and training on FPGAs. But with the recent breakthroughs in the new field of Continuous Learning [19][20][21][22][23], online training on embedded platforms has attracted more research.…”

Section: Introductionmentioning

confidence: 99%

Automated CNN back-propagation pipeline generation for FPGA online training

Mazouz

Bridges

2021

J Real-Time Image Proc

View full text Add to dashboard Cite

Training of convolutional neural networks (CNNs) on embedded platforms to support on-device learning has become essential for the future deployment of CNNs on autonomous systems. In this work, we present an automated CNN training pipeline compilation tool for Xilinx FPGAs. We automatically generate multiple hardware designs from high-level CNN descriptions using a multi-objective optimization algorithm that explores the design space by exploiting CNN parallelism. These designs that trade-off resources for throughput allow users to tailor implementations to their hardware and applications. The training pipeline is generated based on the backpropagation (BP) equations of convolution which highlight an overlap in computation. We translate the overlap into hardware by reusing most of the forward pass (FP) pipeline reducing the resources overhead. The implementation uses a streaming interface that lends itself well to data streams and live feeds instead of static data reads from memory. Meaning, we do not use the standard array of processing elements (PEs) approach, which is efficient for offline inference, instead we translate the architecture into a pipeline where data is streamed through allowing for new samples to be read as they become available. We validate the results using the Zynq-7100 on three datasets and varying size architectures against CPU and GPU implementations. GPUs consistently outperform FPGAs in training times in batch processing scenarios, but in data stream scenarios, FPGA designs achieve a significant speedup compared to GPU and CPU when enough resources are dedicated to the learning task. A 2.8×, 5.8×, and 3× speed up over GPU was achieved on three architectures trained on MNIST, SVHN, and CIFAR-10 respectively.

show abstract

“…As an example, in the case of image classification, moving from the eight-layered AlexNet [ 4 ] to the 152-layered ResNet [ 5 ] the error rates have been reduced by more than 10%, but the amount of performed multiply-and-accumulate (MAC) operations has increased by more than 80%. Such a trend makes evident that ad-hoc designed hardware accelerators are essential for deploying CNN algorithms in real-time and power-constrained systems [ 6 ].…”

Section: Introductionmentioning

confidence: 99%

Design of Flexible Hardware Accelerators for Image Convolutions and Transposed Convolutions

2021

View full text Add to dashboard Cite

Nowadays, computer vision relies heavily on convolutional neural networks (CNNs) to perform complex and accurate tasks. Among them, super-resolution CNNs represent a meaningful example, due to the presence of both convolutional (CONV) and transposed convolutional (TCONV) layers. While the former exploit multiply-and-accumulate (MAC) operations to extract features of interest from incoming feature maps (fmaps), the latter perform MACs to tune the spatial resolution of the received fmaps properly. The ever-growing real-time and low-power requirements of modern computer vision applications represent a stimulus for the research community to investigate the deployment of CNNs on well-suited hardware platforms, such as field programmable gate arrays (FPGAs). FPGAs are widely recognized as valid candidates for trading off computational speed and power consumption, thanks to their flexibility and their capability to also deal with computationally intensive models. In order to reduce the number of operations to be performed, this paper presents a novel hardware-oriented algorithm able to efficiently accelerate both CONVs and TCONVs. The proposed strategy was validated by employing it within a reconfigurable hardware accelerator purposely designed to adapt itself to different operating modes set at run-time. When characterized using the Xilinx XC7K410T FPGA device, the proposed accelerator achieved a throughput of up to 2022.2 GOPS and, in comparison to state-of-the-art competitors, it reached an energy efficiency up to 2.3 times higher, without compromising the overall accuracy.

show abstract

Accelerating Neural Network Inference on FPGA-Based Platforms—A Survey

Cited by 58 publications

References 107 publications

A Survey on the Optimization of Neural Network Accelerators for Micro-AI On-Device Inference

A Survey on the Optimization of Neural Network Accelerators for Micro-AI On-Device Inference

Automated CNN back-propagation pipeline generation for FPGA online training

Design of Flexible Hardware Accelerators for Image Convolutions and Transposed Convolutions

Contact Info

Product

Resources

About