An FPGA-based accelerator for deep neural network with novel reconfigurable architecture

Han, Je-Chin; Ren, Daming; Zou, Xuecheng

doi:10.1587/elex.18.20210012

Cited by 9 publications

(5 citation statements)

References 33 publications

(48 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Identifying an efficient dataflow is crucial for defining the spatial structure of the PEs and the overall PEA. There are four common dataflows [8]: no local reuse (NLR) [4,9], input stationary (IS) [10], output stationary (OS) [11,12], and weight stationary (WS) [6,13]. The NLR dataflow, typically implemented in a tree structure, does not reuse data, leading to higher hardware costs.…”

Section: Related Workmentioning

confidence: 99%

Leveraging Bit-Serial Architectures for Hardware-Oriented Deep Learning Accelerators with Column-Buffering Dataflow

Cheng,

Wang,

Ding

et al. 2024

Electronics

View full text Add to dashboard Cite

Bit-serial neural network accelerators address the growing need for compact and energy-efficient deep learning tools. Traditional neural network accelerators, while effective, often grapple with issues of size, power consumption, and versatility in handling a variety of computational tasks. To counter these challenges, this paper introduces an approach that hinges on the integration of bit-serial processing with advanced dataflow techniques and architectural optimizations. Central to this approach is a column-buffering (CB) dataflow, which significantly reduces access and movement requirements for the input feature map (IFM), thereby enhancing efficiency. Moreover, a simplified quantization process effectively eliminates biases, streamlining the overall computation process. Furthermore, this paper presents a meticulously designed LeNet-5 accelerator leveraging a convolutional layer processing element array (CL PEA) architecture incorporating an improved bit-serial multiply–accumulate unit (MAC). Empirically, our work demonstrates superior performance in terms of frequency, chip area, and power consumption compared to current state-of-the-art ASIC designs. Specifically, our design utilizes fewer hardware resources to implement a complete accelerator, achieving a high performance of 7.87 GOPS on a Xilinx Kintex-7 FPGA with a brief processing time of 284.13 μs. The results affirm that our design is exceptionally suited for applications requiring compact, low-power, and real-time solutions.

show abstract

Section: Related Workmentioning

confidence: 99%

Leveraging Bit-Serial Architectures for Hardware-Oriented Deep Learning Accelerators with Column-Buffering Dataflow

Cheng,

Wang,

Ding

et al. 2024

Electronics

View full text Add to dashboard Cite

show abstract

“…Notably, the ResNet network introduced the concept of residual blocks, enabling the training of deeper neural networks, which helps mitigate the vanishing gradient problem [14]. Work on FPGAs, which offer the advantages of low latency, low power consumption, and high flexibility over traditional hardware acceleration solutions, has been widely carried out [15][16][17][18][19][20][21][22][23][24][25][26][27]. However, they face limitations in on-chip resources, and modifications in network architecture necessitate hardware circuit redesign.…”

Section: Introductionmentioning

confidence: 99%

Hardware accelerator for high accuracy sign language recognition with residual network based on FPGAs

Yang,

Li,

Hao

et al. 2024

IEICE Electron. Express

View full text Add to dashboard Cite

The ResNet series of networks has demonstrated powerful capabilities in the fields of object detection and image classification, garnering increasing attention from researchers. However, due to their deep network architectures, accelerator development based on FPGA faces challenges associated with limited on-chip resources and lengthy design cycles. This paper presents a versatile hardware acceleration system based on FPGA, achieving optimization through both hardware implementation and software inference architecture. The system reduces network complexity by employing techniques such as inter-layer fusion and dynamic quantization, while enhancing hardware resource utilization through channel parallelism and tightly-pipelined hardware design principles. By configuring and reusing computation units, the forward inference process of ResNet series networks can be rapidly deployed on FPGA, shortening the development and validation cycles. The proposed system is validated using the ResNet-18 model on a PYNQ-Z2 development board within a gesture recognition application scenario. The overall power consumption of the system is 2.136W, with hardware inference accuracy reaching 98.87%.

show abstract

“…However, NMS is a greedy algorithm that is computationally intensive and has a complexity of O(N 2 ), leading to increased processing time for a large number of detected targets. Recent many FPGA-based and ASIC edge neural network acceleration chips [7,8,9,10,11,12,13,14] such as UNPU [11], Eyeriss [12], and CASSANN-v2 [13], have been proposed to target general neural network operations (i.e., convolution). However, when deploying object detection neural networks, these chips often offload the NMS algorithm to the on-chip embedded CPU, significantly increasing the end-to-end inference time of object detection neural networks at the edge.Therefore, it is vital to develop a customized circuit to reduce the computation time of the NMS algorithm at the edge.…”

Section: Introductionmentioning

confidence: 99%

High-accuracy low-latency non-maximum suppression processor for traffic object detection

Yuan,

Xu,

Chen

2023

IEICE Electron. Express

View full text Add to dashboard Cite

As autonomous driving technology advances, the requirements for object detection are becoming increasingly high. Non-maximum suppression (NMS) algorithm, as a key component in traffic object detection algorithms, is an independent post-processing process in the object detection framework. Due to the complexity of real-world road scenarios and high density of detected entities in urban traffic, the number of candidate bounding boxes generated by the neural network is large. Hence, low-precision processors may generate a significant number of redundant target bounding boxes. The excessive output of redundant target bounding boxes not only imposes a workload on subsequent processing but also has the potential to result in non-optimal decision-making. We propose a highperformance NMS processor that can quickly process a large number of candidate boxes without performing sorting of their scores. Also, it has low precision loss computing units and high parallel computing arrays. Combined with algorithm design, it effectively reduces the computational complexity and reduces the inference time of the end-to-end task of the NMS algorithm. Thus, our NMS processor's speed is comparable to SOTA architecture, and the average accuracy loss is only 0.4% .

show abstract

An FPGA-based accelerator for deep neural network with novel reconfigurable architecture

Cited by 9 publications

References 33 publications

Leveraging Bit-Serial Architectures for Hardware-Oriented Deep Learning Accelerators with Column-Buffering Dataflow

Leveraging Bit-Serial Architectures for Hardware-Oriented Deep Learning Accelerators with Column-Buffering Dataflow

Hardware accelerator for high accuracy sign language recognition with residual network based on FPGAs

High-accuracy low-latency non-maximum suppression processor for traffic object detection

Contact Info

Product

Resources

About