<i>Uni-OPU</i>: An FPGA-Based Uniform Accelerator for Convolutional and Transposed Convolutional Networks

Yu, Yunxuan; Zhao, Tiandong; Wang, Mingyu; Wang, Kun; He, Lei

doi:10.1109/tvlsi.2020.2995741

Cited by 43 publications

(17 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The obtained results are summarized in Table 1 in terms of: supported parallelism (T M , T N , P M , and P N ), kernel size (K) and stride (S); resources requirements; running frequency; number of operations performed per second (GOPs); and, finally, dynamic power consumption. It is worth highlighting that while the designs presented in [15][16][17] are SUs, those demonstrated in [9,11,18] are embedded heterogeneous systems (ESs). For this reason, several SU and ES versions of the design here presented have been characterized and they are referenced in Table 1.…”

Section: Implementation and Resultsmentioning

confidence: 99%

“…The efficient design strategy recently presented in [ 17 ] overcomes the above issues by performing a kernel conversion to calculate all the pre-addable weight combinations. The output of this process is a new set of filters that can be directly applied to the ifmaps to perform a traditional 3D convolution.…”

Section: Background Related Work and Motivationsmentioning

confidence: 99%

“…As a further advantage, FPGA-based designs also ensure flexibility and low costs to be achieved. However, although, as exhaustively reviewed in [ 7 ], plenty of FPGA-based accelerators can be found in the literature for CNNs, existing works focusing on the design of FPGA-based engines suitable for hardware-accelerating DCNNs are still few [ 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 ] and this problem is still open.…”

Section: Introductionmentioning

confidence: 99%

“…However, for purposes of comparison with state-of-art competitors, the Xilinx Zynq XC7Z020 (Xilinx, San Jose, CA, USA), XC7Z045 (Xilinx), XC7Z100 (Xilinx) [ 23 ] and the Virtex-7 XC7VX690T (Xilinx) [ 24 ] devices have been used as the implementation platforms to characterize the proposed accelerator when running either as a Standalone Unit (SU) or as a part of an Embedded System (ES). When compared to the designs presented in [ 9 , 11 , 15 , 16 , 17 , 18 ], both the SU and the ES implementations proposed here exhibit remarkably higher throughput and they employ significantly lower amounts of Look-Up Tables (LUTs), Flip-Flops (FFs), on-chip Blocks of Random Access Memory (BRAMs), and Digital Signal Processors (DSPs). As an example, when implemented within the XC7Z020 device, the proposed ES is 50% faster than [ 9 ], performs ~20.7× more GOPs, and it occupies 49.8%, 42.7%, 36.6% and 5% less LUTs, FFs, BRAMs, and DSPs, respectively, dissipating just 1.73 W@150 MHz.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Efficient Deconvolution Architecture for Heterogeneous Systems-on-Chip

et al. 2020

View full text Add to dashboard Cite

Today, convolutional and deconvolutional neural network models are exceptionally popular thanks to the impressive accuracies they have been proven in several computer-vision applications. To speed up the overall tasks of these neural networks, purpose-designed accelerators are highly desirable. Unfortunately, the high computational complexity and the huge memory demand make the design of efficient hardware architectures, as well as their deployment in resource- and power-constrained embedded systems, still quite challenging. This paper presents a novel purpose-designed hardware accelerator to perform 2D deconvolutions. The proposed structure applies a hardware-oriented computational approach that overcomes the issues of traditional deconvolution methods, and it is suitable for being implemented within any virtually system-on-chip based on field-programmable gate array devices. In fact, the novel accelerator is simply scalable to comply with resources available within both high- and low-end devices by adequately scaling the adopted parallelism. As an example, when exploited to accelerate the Deep Convolutional Generative Adversarial Network model, the novel accelerator, running as a standalone unit implemented within the Xilinx Zynq XC7Z020 System-on-Chip (SoC) device, performs up to 72 GOPs. Moreover, it dissipates less than 500mW@200MHz and occupies 5.6%, 4.1%, 17%, and 96%, respectively, of the look-up tables, flip-flops, random access memory, and digital signal processors available on-chip. When accommodated within the same device, the whole embedded system equipped with the novel accelerator performs up to 54 GOPs and dissipates less than 1.8W@150MHz. Thanks to the increased parallelism exploitable, more than 900 GOPs can be executed when the high-end Virtex-7 XC7VX690T device is used as the implementation platform. Moreover, in comparison with state-of-the-art competitors implemented within the Zynq XC7Z045 device, the system proposed here reaches a computational capability up to 20% higher, and saves more than 60% and 80% of power consumption and logic resources requirement, respectively, using 5.7× fewer on-chip memory resources.

show abstract

Section: Implementation and Resultsmentioning

confidence: 99%

Section: Background Related Work and Motivationsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Efficient Deconvolution Architecture for Heterogeneous Systems-on-Chip

et al. 2020

View full text Add to dashboard Cite

show abstract

“…DNN acceleration solutions mainly faces two bottlenecks: enormous multiply and accumulate (MAC) operations and great number of parameters. To deal with these problems, researchers have been focused on application and specific integrated circuits (ASIC) [7,8,9,10,11,12,13,14,15] and field-programmable gate array (FPGA) [16,17,18,19,20,21,22,23,24]. Due to its high parallelism property, data flow architectures has become a key research area [8,9,10,11,12,13,18,19,20].…”

Section: Introductionmentioning

confidence: 99%

An FPGA-based accelerator for deep neural network with novel reconfigurable architecture

Han

Ren

Zou

2021

IEICE Electron. Express

View full text Add to dashboard Cite

Due to the high parallelism, Data flow architecture is a common solution for deep neural network (DNN) acceleration, however, existing DNN accelerate solutions exhibit limited flexibility to diverse network models. This paper presents a novel reconfigurable architecture as DNN accelerate solution, which consists of circuit blocks all can be reconfigured to adapt to different networks, and maintain high throughput. The proposed architecture shows good transferability to diverse DNN models due to its reconfigurable processing element (PE) array, which can be adjusted to deal with various filter sizes of networks. In the meanwhile, according to proposed data reuse technique based on parameter proportion property of different layers in DNN, a reconfigurable on-chip buffer mechanism is raised. Moreover, the accelerator enhances its performance by exploiting the sparsity property of input feature map. Compared to other state-of-the-art solutions based on FPGA, our architecture achieves high performance, and presents good flexibility in the meantime.

show abstract