A Data-Center FPGA Acceleration Platform for Convolutional Neural Networks

Gao, Jianlin; Wang, Yuwei; Miao, Jie; Wu, Ephrem; Zhang, Heng; Meng, Yu; Zhang, Bo; Min, Biao; Chen, Dewei

doi:10.1109/fpl.2019.00032

Cited by 23 publications

(9 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is partly due to the advantage of DYNAMAP's optimizations on dataflow and algorithm switching, partly due to the lower-precision we adopted enabling more PEs. Even if we scale down the systolic array size (2 DSP consumption per PE), in the worst case the performance will be halved and we still achieve 2× and 1.4× lower latency compared with [12] and [27] respectively. For Inception-v4, we compare with [31] which applies dynamic memory management to overcome data transfer bottlenecks and [25] that uses kn2row method for all layers in GoogleNet.…”

Section: Evaluation Of Optimizationsmentioning

confidence: 99%

“…We achieve 286MHz frequency for both GoogleNet and Inception-v4 accelerator designs. GoogleNet acceleration using DYNAMAP significantly outperforms [12] and [27] in terms of both latency and throughput. This is partly due to the advantage of DYNAMAP's optimizations on dataflow and algorithm switching, partly due to the lower-precision we adopted enabling more PEs.…”

Section: Evaluation Of Optimizationsmentioning

confidence: 99%

See 1 more Smart Citation

DYNAMAP: Dynamic Algorithm Mapping Framework for Low Latency CNN Inference

Meng,

Kuppannagari,

Kannan

et al. 2020

Preprint

View full text Add to dashboard Cite

Most of the existing work on FPGA acceleration of Convolutional Neural Network (CNN) focus on employing a single strategy (algorithm, dataflow, etc.) across all the layers. Such an approach does not achieve optimal latency on complex and deep CNNs. Emerging CNNs have diverse per-layer computation characteristics including parallelism, arithmetic intensity, locality, and memory footprint. Per-layer strategy selection and fine-grained tuning is required to achieve low end-to-end latency. However, specialized hardware modules dedicated for each layer limit the per-layer utilization and adversely affect end-to-end latency. In this paper, we address these problems by an algorithm-architecture co-optimization framework, DYNAMAP, consisting of (1) a unified hardware overlay that can be reused across layers, supporting dynamic mapping of all three families of popular convolution algorithms, and further allowing flexible dataflow switching to maximize hardware utilization for each layer; (2) a novel software Design Space Exploration (DSE) flow that customizes the hardware overlay and chooses optimal strategy mapping. We show that the algorithm mapping space increases exponentially with network depth, and while the optimal algorithm selection problem is NP-hard in general, by exploiting the series-parallel structure of CNN models, we demonstrate a polynomial-time solution for optimal algorithm mapping. DYNAMAP is optimized for any CNN, including those having diverse computation and memory requirements across the layers. We demonstrate DYNAMAP using two state-of-the-art CNNs -GoogleNet and Inception-V4. The generated accelerators achieve up to 2.8× and 1.4× speedups, respectively, wrt inference latency compared with the state-of-the-art FPGA implementations.

show abstract

Section: Evaluation Of Optimizationsmentioning

confidence: 99%

Section: Evaluation Of Optimizationsmentioning

confidence: 99%

DYNAMAP: Dynamic Algorithm Mapping Framework for Low Latency CNN Inference

Meng,

Kuppannagari,

Kannan

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Yu et al [27] developed an FPGA acceleration platform that leverages a unified framework architecture for generalpurpose CNN inference acceleration at a data center achieving a throughput comparable with the state-of-the-art GPU in this field, with less latency. This work exploits on-chip DSPs, on a Kintex KU115, arranged as supertile units (SUs), to overcome the computational bound and, together with dispachingassembling model and broadcast caches, to deal with the memory bound.…”

Section: Related Workmentioning

confidence: 99%

Optimizing Temporal Convolutional Network inference on FPGA-based accelerators

Carreras¹,

Deriu²,

Raffo³

et al. 2020

Preprint

View full text Add to dashboard Cite

Convolutional Neural Networks are extensively used in a wide range of applications, commonly including computer vision tasks like image and video classification, recognition and segmentation. Recent research results demonstrate that multilayer (deep) network involving mono-dimensional convolutions and dilation can be effectively used in time series and sequences classification and segmentation, as well as in tasks involving sequence modelling. These structures, commonly referred to as Temporal Convolutional Networks (TCNs), have been demonstrated to consistently outperform Recurrent Neural Networks in terms of accuracy and training time [1]. While FPGA-based inference accelerators for classic CNNs are widespread, literature is lacking in a quantitative evaluation of their usability on inference for TCN models. In this paper we present such an evaluation, considering a CNN accelerator with specific features supporting TCN kernels as a reference and a set of state-ofthe-art TCNs as benchmark. Experimental results show that, during TCN execution, operational intensity can be critical for the overall performance. We propose a convolution scheduling based on batch processing that can boost efficiency up to 96% of theoretical peak performance. Overall we can achieve up to 111,8 GOPS/s and a power efficiency of 33,9 GOPS/s/W on an Ultrascale+ ZU3EG (up to 10x speedup and 3x power efficiency improvement with respect to pure software implementation).

show abstract

“…In recent years, advances in integrated circuit technology have brought significant improvements to FPGA performance. Coupled with the inherently high hardware parallelism, FPGAs are used as hardware accelerators in more and more fields, such as signal processing [1,2], scientific computing [3][4][5], machine learning [6][7][8], and data centers [9][10][11]. In these applications, some algorithms include a large number of operations of vector and matrix, which belongs to the category of reduction problem.…”

Section: Introductionmentioning

confidence: 99%

A Novel Reduction Circuit Based on Binary Tree Path Partition on FPGAs

et al. 2021

View full text Add to dashboard Cite

Due to high parallelism, field-programmable gate arrays are widely used as accelerators in engineering and scientific fields, which involve a large number of operations of vector and matrix. High-performance accumulation circuits are the key to large-scale matrix operations. By selecting the adder as the reduction operator, the reduction circuit can implement the accumulation function. However, the pipelined adder will bring challenges to the design of the reduction circuit. To solve this problem, we propose a novel reduction circuit based on binary tree path partition, which can simultaneously handle multiple data sets with arbitrary lengths. It divides the input data into multiple groups and sends them to different iterations for calculation. The elements belonging to the same data set in each group are added to obtain a partial result, and the partial results of the same data set are added to achieve the final result. Compared with other reduction methods, it has the least area-time product.

show abstract

A Data-Center FPGA Acceleration Platform for Convolutional Neural Networks

Cited by 23 publications

References 29 publications

DYNAMAP: Dynamic Algorithm Mapping Framework for Low Latency CNN Inference

DYNAMAP: Dynamic Algorithm Mapping Framework for Low Latency CNN Inference

Optimizing Temporal Convolutional Network inference on FPGA-based accelerators

A Novel Reduction Circuit Based on Binary Tree Path Partition on FPGAs

Contact Info

Product

Resources

About