14.6 A 1.42TOPS/W deep convolutional neural network recognition processor for intelligent IoE systems

Sim, Jaehyeong; Park, Jun‐Seok; Kim, Minhye; Bae, Dong Myung; Choi, Yoo-Joo; Kim, Lee-Sup

doi:10.1109/isscc.2016.7418008

Cited by 119 publications

(50 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One big differentiating point between these platforms are their assumptions in terms of algorithmic and arithmetic accuracy. Jaehyeong et al [29] rely on 24bit fixed-point arithmetics, but they approximate weights using a low-dimensional representation based on PCA. Most other works use either 16 bits [30], [31] or 12 bits [26].…”

Section: B Low-power Cnn Hardware Ipsmentioning

confidence: 99%

See 1 more Smart Citation

An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics

Conti

Schilling

Schiavone

et al. 2017

IEEE Trans. Circuits Syst. I

112

View full text Add to dashboard Cite

Near-sensor data analytics is a promising direction for IoT endpoints, as it minimizes energy spent on communication and reduces network load -but it also poses security concerns, as valuable data is stored or sent over the network at various stages of the analytics pipeline. Using encryption to protect sensitive data at the boundary of the on-chip analytics engine is a way to address data security issues. To cope with the combined workload of analytics and encryption in a tight power envelope, we propose Fulmine, a System-on-Chip based on a tightly-coupled multi-core cluster augmented with specialized blocks for compute-intensive data processing and encryption functions, supporting software programmability for regular computing tasks. The Fulmine SoC, fabricated in 65 nm technology, consumes less than 20 mW on average at 0.8 V achieving an efficiency of up to 70 pJ/B in encryption, 50 pJ/px in convolution, or up to 25 MIPS/mW in software. As a strong argument for real-life flexible application of our platform, we show experimental results for three secure analytics use cases: secure autonomous aerial surveillance with a state-of-the-art deep CNN consuming 3.16 pJ per equivalent RISC op; local CNN-based face detection with secured remote recognition in 5.74 pJ/op; and seizure detection with encrypted data collection from EEG within 12.7 pJ/op.

show abstract

Section: B Low-power Cnn Hardware Ipsmentioning

confidence: 99%

“…e Weights produced on-chip from a small set of PCA bases to save area/power. No evaluation on the general validity of this approach is presented in [29]. f Performance & power of inference engines only, estimating they are responsible for 20% of total power.…”

Section: A System-on-chip Operating Modesmentioning

confidence: 99%

An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics

Conti

Schilling

Schiavone

et al. 2017

IEEE Trans. Circuits Syst. I

112

View full text Add to dashboard Cite

show abstract

“…KU Leuven's accelerator [16] is a SIMD array system with dynamic voltage and bit precision control, aiming for low-power mobile applications. The accelerator proposed by KAIST [17] is a CNN accelerator which employs principal component analysis for the weights of convolutional layers to minimize the data size read from external memory. ShiDianNao [18] and its previous work DaDianNao [19] also retain the weight values in the internal buffers and employ spatial-mapped neural function unit.…”

Section: Related Workmentioning

confidence: 99%

A Multithreaded CGRA for Convolutional Neural Network Processing

Ando¹,

Takamaeda-Yamazaki²,

Ikebe³

et al. 2017

View full text Add to dashboard Cite

Convolutional neural network (CNN) is an essential model to achieve high accuracy in various machine learning applications, such as image recognition and natural language processing. One of the important issues for CNN acceleration with high energy efficiency and processing performance is efficient data reuse by exploiting the inherent data locality. In this paper, we propose a novel CGRA (Coarse Grained Reconfigurable Array) architecture with timedomain multithreading for exploiting input data locality. The multithreading on each processing element enables the input data reusing through multiple computation periods. This paper presents the accelerator design performance analysis of the proposed architecture. We examine the structure of memory subsystems, as well as the architecture of the computing array, to supply required data with minimal performance overhead. We explore efficient architecture design alternatives based on the characteristics of modern CNN configurations. The evaluation results show that the available bandwidth of the external memory can be utilized efficiently when the output plane is wider (in earlier layers of many CNNs) while the input data locality can be utilized maximally when the number of output channel is larger (in later layers).

show abstract

“…In academia, three representative works at the architectural level are Eyeriss [23], EIE [24], and the DianNao family [25][26][27], which focus specifically on the convolutional layers, the fully-connected layers, and the memory design/organization, respectively. There are a number of recent tapeouts of hardware deep learning systems [23,[28][29][30][31][32][33].…”

Section: Introductionmentioning

confidence: 99%

C ir CNN

Ding

Liao

Wang

et al. 2017

Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture

179

View full text Add to dashboard Cite

Large-scale deep neural networks (DNNs) are both compute and memory intensive. As the size of DNNs continues to grow, it is critical to improve the energy efficiency and performance while maintaining accuracy. For DNNs, the model size is an important factor affecting performance, scalability and energy efficiency. Weight pruning achieves good compression ratios but suffers from three drawbacks: 1) the irregular network structure after pruning, which affects performance and throughput; 2) the increased training complexity; and 3) the lack of rigorous guarantee of compression ratio and inference accuracy.To overcome these limitations, this paper proposes CirCNN, a principled approach to represent weights and process neural networks using block-circulant matrices. CirCNN utilizes the Fast Fourier Transform (FFT)-based fast multiplication, simultaneously reducing the computational complexity (both in inference and training) from O(n 2 ) to O(n log n) and the storage complexity from O(n 2 ) to O(n), with negligible accuracy loss. Compared to other approaches, CirCNN is distinct due to its mathematical rigor: the DNNs based on CirCNN can converge to the same "effectiveness" as DNNs without compression. We propose the CirCNN architecture, a universal DNN inference engine that can be implemented in various hardware/software platforms with configurable network architecture (e.g., layer type, size, scales, etc.). In CirCNN architecture: 1) Due to the recursive property, FFT can be used as the key computing kernel, which ensures universal and small-footprint implementations.2) The compressed but regular network structure avoids the pitfalls of the network pruning and facilitates high performance and throughput with highly pipelined and parallel design. To demonstrate the performance and energy efficiency, we test Cir-CNN in FPGA, ASIC and embedded processors. Our results show that CirCNN architecture achieves very high energy efficiency and Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. MICRO-50, October 14-18, 2017 performance with a small hardware footprint. Based on the FPGA implementation and ASIC synthesis results, CirCNN achieves 6 -102X energy efficiency improvements compared with the best state-of-the-art results.

show abstract

14.6 A 1.42TOPS/W deep convolutional neural network recognition processor for intelligent IoE systems

Cited by 119 publications

References 6 publications

An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics

An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics

A Multithreaded CGRA for Convolutional Neural Network Processing

C ir CNN

Contact Info

Product

Resources

About