ShiDianNao

Du, Zidong; Fasthuber, Robert; Ienne, Paolo; Li, Ling; Luo, Tao; Feng, Xiaobing; Chen, Tianshi; Temam, Olivier

doi:10.1145/2872887.2750389

Cited by 148 publications

(19 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Krizhevsky, A. et al used a local response normalization (LRN) operation in AlexNet to reduce the error rates of top-1 and top-5 by 1.4% and 1.2% [24]. Du, Z. et al added local response normalization (LRN) and local contrast normalization (LCN) in the design of ShiDianNao, which improved the recognition accuracy, but increased the computational and hardware complexity [40]. The batch normalization proposed by Ioffe, S. et al in 2015 is widely used in the deep neural network, which effectively accelerates the speed of training and convergence [51].…”

Section: Normalization Layermentioning

confidence: 99%

See 1 more Smart Citation

Efficient Object Detection Framework and Hardware Architecture for Remote Sensing Images

Lin

Zhang

Wu³

2019

Remote Sensing

View full text Add to dashboard Cite

Object detection in remote sensing images on a satellite or aircraft has important economic and military significance and is full of challenges. This task requires not only accurate and efficient algorithms, but also highperformance and low power hardware architecture. However, existing deep learning based object detection algorithms require further optimization in small objects detection, reduced computational complexity and parameter size. Meanwhile, the generalpurpose processor cannot achieve better power efficiency, and the previous design of deep learning processor has still potential for mining parallelism. To address these issues, we propose an efficient contextbased feature fusion single shot multibox detector (CBFFSSD) framework, using lightweight MobileNet as the backbone network to reduce parameters and computational complexity, adding feature fusion units and detecting feature maps to enhance the recognition of small objects and improve detection accuracy. Based on the analysis and optimization of the calculation of each layer in the algorithm, we propose efficient hardware architecture of deep learning processor with multiple neural processing units (NPUs) composed of 2D processing elements (PEs), which can simultaneously calculate multiple output feature maps. The parallel architecture, hierarchical onchip storage organization, and the local register are used to achieve parallel processing, sharing and reuse of data, and make the calculation of processor more efficient. Extensive experiments and comprehensive evaluations on the public NWPU VHR10 dataset and comparisons with some stateoftheart approaches demonstrate the effectiveness and superiority of the proposed framework. Moreover, for evaluating the performance of proposed hardware architecture, we implement it on Xilinx XC7Z100 field programmable gate array (FPGA) and test on the proposed CBFFSSD and VGG16 models. Experimental results show that our processor are more power efficient than general purpose central processing units (CPUs) and graphics processing units (GPUs), and have better performance density than other stateoftheart FPGAbased designs.

show abstract

Section: Normalization Layermentioning

confidence: 99%

“…proposed ShiDianNao based on 2-D mesh topology structure for image recognition applications near to sensors, and reduced memory usage through weight sharing [40]. Zhang, C. et al designed a CNN accelerator based on the adder tree structure by quantitative analysis of memory bandwidth required for throughput [41].…”

Section: Introductionmentioning

confidence: 99%

Efficient Object Detection Framework and Hardware Architecture for Remote Sensing Images

Lin

Zhang

Wu³

2019

Remote Sensing

View full text Add to dashboard Cite

show abstract

“…DaDianNao supports convolution, pooling, class er and LRN layers, and when using a 64-node architecture, it achieves more than 2000x accelerations in convolution computation compared to GPU baselines. ShiDianNao [8] focuses on accelerating convolution operations in embedded applications, and supports pooling, classi cation, and normalization layers as well. ShiDianNao uses inter-PE data propagation to reduce memory access in convolution, which makes it high energy e ciency.…”

Section: Introductionmentioning

confidence: 99%

A Configurable Simulator for Neural Network Processors

Jhj¹,

Zhang²,

Shi³

2021

Preprint

View full text Add to dashboard Cite

Deep learning has achieved competing results comparing with human beings in many fields. Traditionally, deep learning networks are executed on CPUs and GPUs. In recent years, more and more Neural Network accelerators have been introduced in both academia and industry to improve the performance and energy efficiency for deep learning networks. In this paper, we introduce a flexible and configurable functional NN accelerator simulator, which could be configured to simulate u-architectures for different NN accelerators. The extensible and configurable simulator is helpful for system-level exploration of u-architecture, as well as operator optimization algorithm developments. We also integrated the simulator into the TVM compilation stack as an optional back-end. Users can use TVM to write operators and execute them on the simulator. The simulator is going to be open sourced.

show abstract

“…Several libraries and frameworks have been developed for the implementation of DNNs via GPUs; these include Theano (which is a Python library) [ 20 ] and Caffe (a deep learning framework) [ 21 ], Tensor Flow and Chianer (Python-based deep learning frameworks) [ 22 , 23 ]. Some DNNs, such as some MLPs [ 24 , 25 ], RBMs [ 26 , 27 ], and CNNs [ 28 – 31 ], have been developed as dedicated chips. One of the report uses RBMs for training and AEs for inference [ 32 ].…”

Section: Introductionmentioning

confidence: 99%

A shared synapse architecture for efficient FPGA implementation of autoencoders

2018

View full text Add to dashboard Cite

This paper proposes a shared synapse architecture for autoencoders (AEs), and implements an AE with the proposed architecture as a digital circuit on a field-programmable gate array (FPGA). In the proposed architecture, the values of the synapse weights are shared between the synapses of an input and a hidden layer, and between the synapses of a hidden and an output layer. This architecture utilizes less of the limited resources of an FPGA than an architecture which does not share the synapse weights, and reduces the amount of synapse modules used by half. For the proposed circuit to be implemented into various types of AEs, it utilizes three kinds of parameters; one to change the number of layers’ units, one to change the bit width of an internal value, and a learning rate. By altering a network configuration using these parameters, the proposed architecture can be used to construct a stacked AE. The proposed circuits are logically synthesized, and the number of their resources is determined. Our experimental results show that single and stacked AE circuits utilizing the proposed shared synapse architecture operate as regular AEs and as regular stacked AEs. The scalability of the proposed circuit and the relationship between the bit widths and the learning results are also determined. The clock cycles of the proposed circuits are formulated, and this formula is used to estimate the theoretical performance of the circuit when the circuit is used to construct arbitrary networks.

show abstract

ShiDianNao

Cited by 148 publications

References 47 publications

Efficient Object Detection Framework and Hardware Architecture for Remote Sensing Images

Efficient Object Detection Framework and Hardware Architecture for Remote Sensing Images

A Configurable Simulator for Neural Network Processors

A shared synapse architecture for efficient FPGA implementation of autoencoders

Contact Info

Product

Resources

About