ConvNets, or Convolutional Neural Networks (CNN), are state-of-the-art classification algorithms, achieving near-human performance in visual recognition [1]. New trends such as augmented reality demand always-on visual processing in wearable devices. Yet, advanced ConvNets achieving high recognition rates are too expensive in terms of energy as they require substantial data movement and billions of convolution computations. Today, state-of-the-art mobile GPU's and ConvNet accelerator ASICs [2][3] only demonstrate energy-efficiencies of 10's to several 100's GOPS/W, which is one order of magnitude below requirements for always-on applications. This paper introduces the concept of hierarchical recognition processing, combined with the Envision platform: an energy-scalable ConvNet processor achieving efficiencies up to 10TOPS/W, while maintaining recognition rate and throughput. Envision hereby enables always-on visual recognition in wearable devices. Figure 14.5.1 demonstrates the concept of hierarchical recognition. Here, a hierarchy of increasingly complex individually trained ConvNets, with different topologies, different network sizes and increasing computational precision requirements, is used in the context of person identification. This enables constant scanning for faces at very low average energy cost, yet rapidly scales up to more complex networks detecting a specific face such as a device's owner, all the way up to full VGG-16-based 5760-face recognition. The opportunities afforded by such a hierarchical approach span far beyond face recognition alone, but can only be exploited by digital systems demonstrating wide-range energy scalability across computational precision. State-of-the-art ASICs in references [3] and [4] only show 1.5× and 8.2× energy-efficiency scalability, respectively. Envision improves upon this by introducing subword-parallel Dynamic-Voltage-AccuracyFrequency Scaling (DVAFS), a circuit-level technique enabling 40× energy-precision scalability at constant throughput. Figure 14.5.2 illustrates the basic principle of DVAFS and compares it to Dynamic-Accuracy Scaling (DAS) and Dynamic-Voltage-Accuracy Scaling (DVAS) [4]. In DAS, switching activity and hence energy consumption is reduced for low precision computations by rounding and masking a configurable number of LSB's at the inputs of multiplyaccumulate (MAC) units. DVAS exploits shorter critical paths in DAS's reduced-precision modes by combining it with voltage scaling for increased energy scalability. This paper proposes subword-parallel DVAFS, which further improves upon DVAS, by reusing inactive arithmetic cells at reduced precision. These can be reconfigured to compute 2×1-8b or 4×1-4b (N×1-16b/N, with N the level of subword-parallelism), rather than 1×1-16b words per cycle, when operating at less than 8b precision. At constant data throughput, this permits lowering the processor's frequency and voltage significantly below DVAS values. As a result, DVAFS is a dynamic precision technique which simultaneously lowers all run-time a...
A low-power precision-scalable processor for ConvNets or convolutional neural networks (CNN) is implemented in a 40nm technology. Its 256 parallel processing units achieve a peak 102GOPS running at 204MHz. To minimize energy consumption while maintaining throughput, this works is the first to both exploit the sparsity of convolutions and to implement dynamic precision-scalability enabling supply-and energy scaling. The processor is fully C-programmable, consumes 25-288mW at 204 MHz and scales efficiency from 0.3-2.6 real TOPS/W. This system hereby outperforms the state-of-the-art up to 3.9x in energy efficiency.Introduction Recently CNN's ( Fig. 1) have come up as state-of-the-art classification algorithms, achieving near-human performance in speech-recognition and visual-detection [1-3]. However, they are typically very expensive in terms of energy consumption. In [4], an algorithm-level study, we demonstrated opportunities for drastic energy reductions in CNN's through dynamic word length scaling and sparse guarding. Precision requirements vary across CNN's and even across CNN-layers, as the necessary number of bits can go down from 16 to 5 or even 1 bit for different benchmarks, with less than 1% accuracy loss (Tab. 1). Origami [5], Nvidia Tegra [6], and Eyeriss [7] offer non-optimal embedded solutions, as they keep computational precision constant and do not adapt to varying requirements. This work is the first to exploit these opportunities and to implement them in a state-of-the-art CNN architecture. It optimizes energy consumption for any CNN with any precision requirement up to 16-bit fixed point, without sacrificing flexibility, programmability, accuracy or throughput. We hereby empower low power, highperformance embedded applications of computer vision.Low Power CNN Processor Design This CNN-processor achieves scalable low power operation through three key innovations: (A) a 2D-SIMD MAC-array with shifted inputs, (B) dynamic precision and voltage-scaling and (C) guarded data-fetches and -operations. Figure 2 shows the high level processor-overview. It contains a precisionscalable 2D-SIMD array in a voltage-scalable power domain, a total of 148kB on-chip data-, guard-and program memory, max-pool and Rectified Linear Unit vector-arithmetic and a DMA with Huffman compression, all in a fixed power domain. The processor has a custom VLIW and SIMD instruction set and is fully programmable in C using dedicated libraries and a custom generated compiler. The chip is clock-gated and operator guarded where possible to save dynamic power.A. The 16x16 2D-SIMD MAC array (Fig. 3) generates 256 intermediate outputs per cycle while consuming only 16+16 inputs. These MACs are single cycle and contain a 48-bit accumulation register. In an 11x11 convolution example, the MAC-array takes in 16 subsequent pixels from a single image channel and 16 filter weights from different filters in the first cycle. In the next 10 cycles, 17 words are fetched: 16 filter weights and a single pixel, which is shifted through a shift-register...
This work targets the automated minimum-energy optimization of Quantized Neural Networks (QNNs) -networks using low precision weights and activations. These networks are trained from scratch at an arbitrary fixed point precision. At iso-accuracy, QNNs using fewer bits require deeper and wider network architectures than networks using higher precision operators, while they require less complex arithmetic and less bits per weights. This fundamental trade-off is analyzed and quantified to find the minimum energy QNN for any benchmark and hence optimize energy-efficiency. To this end, the energy consumption of inference is modeled for a generic hardware platform. This allows drawing several conclusions across different benchmarks. First, energy consumption varies orders of magnitude at iso-accuracy depending on the number of bits used in the QNN. Second, in a typical system, BinaryNets or int4 implementations lead to the minimum energy solution, outperforming int8 networks up to 2 − 10× at iso-accuracy. All code used for QNN training is available from https://github.com/BertMoons/.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.