Cambricon: An Instruction Set Architecture for Neural Networks

Liu, Shaoli; Du, Zidong; Tao, Jianhua; Han, Dong Seog; Luo, Tao; Xie, Yuan; Chen, Tianshi

doi:10.1109/isca.2016.42

Cited by 146 publications

(117 citation statements)

References 40 publications

Supporting

Mentioning

113

Contrasting

Unclassified

Order By: Relevance

“…Instruction set. Previous SIMD works usually devised load instructions for parameters and adopted medium-grained operands for features, such as vector/matrix [37], 2D tile [49], and compute tile [47], for providing flexibility. In contrast, we apply a parameterinside approach and large-grained feature operands for optimizing power consumption and computing capability for highly-parallel convolution.…”

Section: Related Workmentioning

confidence: 99%

eCNN

Huang

Ding

Wang

et al. 2019

Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

View full text Add to dashboard Cite

Convolutional neural networks (CNNs) have recently demonstrated superior quality for computational imaging applications. Therefore, they have great potential to revolutionize the image pipelines on cameras and displays. However, it is difficult for conventional CNN accelerators to support ultra-high-resolution videos at the edge due to their considerable DRAM bandwidth and power consumption. Therefore, finding a further memory-and computation-efficient microarchitecture is crucial to speed up this coming revolution.In this paper, we approach this goal by considering the inference flow, network model, instruction set, and processor design jointly to optimize hardware performance and image quality. We apply a block-based inference flow which can eliminate all the DRAM bandwidth for feature maps and accordingly propose a hardwareoriented network model, ERNet, to optimize image quality based on hardware constraints. Then we devise a coarse-grained instruction set architecture, FBISA, to support power-hungry convolution by massive parallelism. Finally, we implement an embedded processor-eCNN-which accommodates to ERNet and FBISA with a flexible processing architecture. Layout results show that it can support high-quality ERNets for super-resolution and denoising at up to 4K Ultra-HD 30 fps while using only DDR-400 and consuming 6.94W on average. By comparison, the state-of-the-art Diffy uses dualchannel DDR3-2133 and consumes 54.3W to support lower-quality VDSR at Full HD 30 fps. Lastly, we will also present application examples of high-performance style transfer and object recognition to demonstrate the flexibility of eCNN.

show abstract

Section: Related Workmentioning

confidence: 99%

eCNN

Huang

Ding

Wang

et al. 2019

Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

View full text Add to dashboard Cite

show abstract

“…Eyeriss [6] and ShiDianNao [9] improve the NFU dataflow to maximize operand reuse. A number of other digital designs [16], [20], [10] have also emerged in the past year. Analog Accelerators.…”

Section: B the Landscape Of Cnn Acceleratorsmentioning

confidence: 99%

Newton: Gravitating Towards the Physical Limits of Crossbar Acceleration

et al. 2018

View full text Add to dashboard Cite

Many recent works have designed accelerators for Convolutional Neural Networks (CNNs). While digital accelerators have relied on near data processing, analog accelerators have further reduced data movement by performing in-situ computation. Recent works take advantage of highly parallel analog in-situ computation in memristor crossbars to accelerate the many vector-matrix multiplication operations in CNNs. However, these in-situ accelerators have two significant short-comings that we address in this work. First, the ADCs account for a large fraction of chip power and area. Second, these accelerators adopt a homogeneous design where every resource is provisioned for the worst case. By addressing both problems, the new architecture, Newton, moves closer to achieving optimal energy-per-neuron for crossbar accelerators.We introduce multiple new techniques that apply at different levels of the tile hierarchy. Two of the techniques leverage heterogeneity: one adapts ADC precision based on the requirements of every sub-computation (with zero impact on accuracy), and the other designs tiles customized for convolutions or classifiers. Two other techniques rely on divide-and-conquer numeric algorithms to reduce computations and ADC pressure. Finally, we place constraints on how a workload is mapped to tiles, thus helping reduce resource provisioning in tiles. For a wide range of CNN dataflows and structures, Newton achieves a 77% decrease in power, 51% improvement in energy efficiency, and 2.2× higher throughput/area, relative to the state-of-the-art ISAAC accelerator.

show abstract

“…In addition, FT2000 and CEVA-XM6 are vector processors that includes vector processing unit and scalar processing unit, the main difference is that CEVA-XM6 is designed for accelerating only matrix convolution, while FT2000 is optimized by improvement of the algorithm. The similarity between FT2000 and Cambricon [22] are that they are all programmable by instruction set, using instruction set can quickly realize different kinds of neural network, except that Cambricon only simulates and does not have a slice. FT2000 and TPU Table 1 Comparison of parameters between FT2000 and current mainstream neural network accelerators are similar in their architecture, except that FT2000 is a general-purpose neural network accelerator, while TPU only supports CNN, LSTM, and MLP.…”

Section: Comparison Of Ft2000 With Other Processor Architecturesmentioning

confidence: 99%

“…The computing time of convolutional layers accounts for about 85% of the total model [22], thus accelerating convolution calculation in CNN becomes a hotspot in current neural network acceleration. As the convolution calculation is mainly carried out by a large input feature map and a small convolutional kernel, the convolutional kernel is smaller in size, such as 1 × 1, 3 × 3, 5 × 5, whereas the input feature map has a larger scale, such as 224 × 224 × 3 in GoogLeNet.…”

Section: Data Layout Analysismentioning

confidence: 99%

Design and Implementation of Deep Neural Network for Edge Computing

Zhang

Guo

et al. 2018

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYIn recent years, deep learning based image recognition, speech recognition, text translation and other related applications have brought great convenience to people's lives. With the advent of the era of internet of everything, how to run a computationally intensive deep learning algorithm on a limited resources edge device is a major challenge. For an edge oriented computing vector processor, combined with a specific neural network model, a new data layout method for putting the input feature maps in DDR, rearrangement of the convolutional kernel parameters in the nuclear memory bank is proposed. Aiming at the difficulty of parallelism of two-dimensional matrix convolution, a method of parallelizing the matrix convolution calculation in the third dimension is proposed, by setting the vector register with zero as the initial value of the max pooling to fuse the rectified linear unit (ReLU) activation function and pooling operations to reduce the repeated access to intermediate data. On the basis of single core implementation, a multi-core implementation scheme of Inception structure is proposed. Finally, based on the proposed vectorization method, we realize five kinds of neural network models, namely, AlexNet, VGG16, VGG19, GoogLeNet, ResNet18, and performance statistics and analysis based on CPU, gtx1080TI and FT2000 are presented. Experimental results show that the vector processor has better computing advantages than CPU and GPU, and can calculate large-scale neural network model in real time.

show abstract

Cambricon: An Instruction Set Architecture for Neural Networks

Cited by 146 publications

References 40 publications

eCNN

eCNN

Newton: Gravitating Towards the Physical Limits of Crossbar Acceleration

Design and Implementation of Deep Neural Network for Edge Computing

Contact Info

Product

Resources

About