Automated Systolic Array Architecture Synthesis for High Throughput CNN Inference on FPGAs

Wei, Xiaochen; Yu, Cody Hao; Zhang, Peng; Chen, Youxiang; Wang, Yuxin; Han, Hu; Liang, Yun; Cong, Jason

doi:10.1145/3061639.3062207

Cited by 345 publications

(175 citation statements)

References 21 publications

Supporting

Mentioning

174

Contrasting

Order By: Relevance

“…Our work uses instead a higher precision data type (27 bits data, 18 bits weights), considering the specific hardware multiplier implementation using 27 by 18 multipliers, which cannot (easily) be tiled with lower size operators. Considering these limitations, our work positions between [2] and [3], and is one of the fastest implementations compared to state of the art, while providing an especially good accuracy. In order to significantly improve the performance further we will employ the Winograd transformation in future research.…”

Section: B Results and Comparisonmentioning

confidence: 99%

See 1 more Smart Citation

Convolutional Neural Networks on Dataflow Engines

Voss

Bacis

Mencer

et al. 2017

2017 IEEE International Conference on Computer Design (ICCD)

View full text Add to dashboard Cite

Abstract-In this paper we discuss a high performance implementation for Convolutional Neural Networks (CNNs) inference on the latest generation of Dataflow Engines (DFEs).We discuss the architectural choices made during the design phase taking into account the DFE chip properties. We then perform design space exploration, considering the memory bandwidth and resources utilisation constraints derived from the used DFE and the chosen architecture.Finally, we discuss the high performance implementation and compare the obtained performance against other implementations, showing that our proposed design reaches 2,450 GOPS when running VGG16 as a test case.

show abstract

Section: B Results and Comparisonmentioning

confidence: 99%

“…In [2], the authors propose an end-to-end automation flow for systolic array design synthesis. A 2D systolic array structure improves the timing and the data reuse of the design, and is obtained from the analysis of the nested loops implementing the considered algorithm.…”

Section: Maxj and Max5 Dfementioning

confidence: 99%

Convolutional Neural Networks on Dataflow Engines

Voss

Bacis

Mencer

et al. 2017

2017 IEEE International Conference on Computer Design (ICCD)

View full text Add to dashboard Cite

show abstract

“…For convolution layers, in which the processing is described in listing 6a, nding the optimal PE con guration can be seen as a loop optimization problem [39,9,28] [77,65,40,78,36,79,80,43]. This problem is addressed by applying loop optimization techniques such loop unrolling, loop tiling or loop interchange to the 7 nested loops of listing 6a.…”

Section: Simd Accelerators and Loop Optimizationmentioning

confidence: 99%

Accelerating the CNN Inference on FPGAs

Abdelouahab¹,

Pelcat²,

Berry³

2020

Deep Learning in Computer Vision

View full text Add to dashboard Cite

Convolutional Neural Networks (CNNs) are currently adopted to solve an ever greater number of problems, ranging from speech recognition to image classi cation and segmentation. The large amount of processing required by CNNs calls for dedicated and tailored hardware support methods. Moreover, CNN workloads have a streaming nature, well suited to recon gurable hardware architectures such as FPGAs.The amount and diversity of research on the subject of CNN FPGA acceleration within the last 3 years demonstrates the tremendous industrial and academic interest. This paper presents a state-of-the-art of CNN inference accelerators over FPGAs. The computational workloads, their parallelism and the involved memory accesses are analyzed. At the level of neurons, optimizations of the convolutional and fully connected layers are explained and the performances of the di erent methods compared. At the network level, approximate computing and datapath optimization methods are covered and state-of-the-art approaches compared. The methods and tools investigated in this survey represent the recent trends in FPGA CNN inference accelerators and will fuel the future advances on e cient hardware deep learning.

show abstract

“…This method is demonstrated to achieve higher hardware performance than iterative pruning due to the regularity in weight storage and computation. The second one is efficient hardware implementations, including FPGAs and ASICs [1,7,31,32,36,49,50,53,54]. FPGAs are gaining more popularity for striking a balance between high hardware performance and fast development round.…”

Section: Introductionmentioning

confidence: 99%

Req-Yolo

Ding

Wang

Liu

et al. 2019

Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Self Cite

View full text Add to dashboard Cite

Deep neural networks (DNNs), as the basis of object detection, will play a key role in the development of future autonomous systems with full autonomy. The autonomous systems have special requirements of real-time, energy-efficient implementations of DNNs on a power-constrained system. Two research thrusts are dedicated to performance and energy efficiency enhancement of the inference phase of DNNs. The first one is model compression techniques while the second is efficient hardware implementation. Recent works on extremely-low-bit CNNs such as the binary neural network (BNN) and XNOR-Net replace the traditional floating point operations with binary bit operations which significantly reduces the memory bandwidth and storage requirement. However, it suffers from nonnegligible accuracy loss and underutilized digital signal processing (DSP) blocks of FPGAs.To overcome these limitations, this paper proposes REQ-YOLO, a resource aware, systematic weight quantization framework for object detection, considering both algorithm and hardware resource aspects in object detection. We adopt the block-circulant matrix method and propose a heterogeneous weight quantization using Alternating Direction Method of Multipliers (ADMM), an effective optimization technique for general, non-convex optimization problems. To achieve real-time, highly-efficient implementations on FPGA, we present the detailed hardware implementation of block circulant matrices on CONV layers and develop an efficient processing element (PE) structure supporting the heterogeneous weight quantization, CONV dataflow and pipelining techniques, design optimization, and a template-based automatic synthesis framework to optimally exploit hardware resource. Experimental results show that our proposed REQ-YOLO framework can significantly compress the YOLO model while introducing very small accuracy degradation.

show abstract

Automated Systolic Array Architecture Synthesis for High Throughput CNN Inference on FPGAs

Cited by 345 publications

References 21 publications

Convolutional Neural Networks on Dataflow Engines

Convolutional Neural Networks on Dataflow Engines

Accelerating the CNN Inference on FPGAs

Req-Yolo

Contact Info

Product

Resources

About