Enabling mixed-precision quantized neural networks in extreme-edge devices

Bruschi, Nazareno; Garofalo, Angelo; Conti, Francesco; Tagliavini, Giuseppe; Rossi, Davide

doi:10.1145/3387902.3394038

Cited by 23 publications

(28 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…. By selecting the position of the fixed point, standard base-2 LNS can represent implicit bases such as 8 √ 2, 4 √ 2, 2 √ 2, 2, 4, 8, and so on. Remarkably, despite a very extensive literature on LNS over several decades, we have been unable to find a statement of this simple identity in the literature.…”

Section: Bases and Base Aliasing In Lnsmentioning

confidence: 99%

“…Smaller word sizes can reduce the memory footprint of data, and the complexity of arithmetic for a variety of number systems [9,14,17,33], which will have a significant impact on the ability to deploy systems in resource constrained embedded devices deployed on the edge of networks. Further, fixed-point number systems with very short word lengths have been proposed in the literature for a variety of signal processing applications [2,24], while shorter floating-and fixedpoint numbers have also been used for neural networks [8,18,19,39,43,44].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Low-precision Logarithmic Number Systems

Alam

Garland

Gregg

2021

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Logarithmic number systems (LNS) are used to represent real numbers in many applications using a constant base raised to a fixed-point exponent making its distribution exponential. This greatly simplifies hardware multiply, divide, and square root. LNS with base-2 is most common, but in this article, we show that for low-precision LNS the choice of base has a significant impact. We make four main contributions. First, LNS is not closed under addition and subtraction, so the result is approximate. We show that choosing a suitable base can manipulate the distribution to reduce the average error. Second, we show that low-precision LNS addition and subtraction can be implemented efficiently in logic rather than commonly used ROM lookup tables, the complexity of which can be reduced by an appropriate choice of base. A similar effect is shown where the result of arithmetic has greater precision than the input. Third, where input data from external sources is not expected to be in LNS, we can reduce the conversion error by selecting a LNS base to match the expected distribution of the input. Thus, there is no one base that gives the global optimum, and base selection is a trade-off between different factors. Fourth, we show that circuits realized in LNS require lower area and power consumption for short word lengths.

show abstract

Section: Bases and Base Aliasing In Lnsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Low-precision Logarithmic Number Systems

Alam

Garland

Gregg

2021

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…However, the extensions proposed in Garofalo et al [10] only tackle part of the challenge, lacking featuring support for mixed-precision operations. Mixed-precision execution requires data conversion and packing/unpacking operations leading to significant overheads if not natively supported by the underlying hardware [11]. When applied to DNNs, exploiting mixed-precision computations on stateof-the-art processors dramatically reduces the memory footprint enabling the execution of MobileNets on tiny endnodes.…”

Section: Introductionmentioning

confidence: 99%

Dustin: A 16-Cores Parallel Ultra-Low-Power Cluster with 2b-to-32b Fully Flexible Bit-Precision and Vector Lockstep Execution Mode

Ottavi¹,

Garofalo²,

Tagliavini³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Computationally intensive algorithms such as Deep Neural Networks (DNNs) are becoming killer applications for edge devices. Porting heavily data-parallel algorithms on resource-constrained and battery-powered devices poses several challenges related to memory footprint, computational throughput, and energy efficiency. Low-bitwidth and mixed-precision arithmetic have been proven to be valid strategies for tackling these problems. We present Dustin, a fully programmable compute cluster integrating 16 RISC-V cores capable of 2-to 32-bit arithmetic and all possible mixed-precision permutations. In addition to a conventional Multiple-Instruction Multiple-Data (MIMD) processing paradigm, Dustin introduces a Vector Lockstep Execution Mode (VLEM) to minimize power consumption in highly data-parallel kernels. In VLEM, a single leader core fetches instructions and broadcasts them to the 15 follower cores. Clock gating Instruction Fetch (IF) stages and private caches of the follower cores leads to 38% power reduction with minimal performance overhead (< 3%). The cluster, implemented in 65 nm CMOS technology, achieves a peak performance of 58 GOPS and a peak efficiency of 1.15 TOPS/W.

show abstract

“…Some researches exploit the property of DNN to reduce latency by using the parallel characteristics of special acceleration circuit design, such as [ 8 , 9 , 10 , 11 , 12 , 13 , 14 ]. Yet these works ignore that the whole power consumption exceeds budget.…”

Section: Introductionmentioning

confidence: 99%

“…To alleviate the poor-performance problems, a number of studies have been undertaken to accelerate DNN implementations by designing hardware-accelerated intelligent computing architecture for sensing system. Some researches exploit the property of DNN to reduce latency by using the parallel characteristics of special acceleration circuit design, such as [8][9][10][11][12][13][14]. Yet these works ignore that the whole power consumption exceeds budget.…”

Section: Introductionmentioning

confidence: 99%

A Heterogeneous RISC-V Processor for Efficient DNN Application in Smart Sensing System

Zhang¹,

et al. 2021

Sensors

View full text Add to dashboard Cite

Extracting features from sensing data on edge devices is a challenging application for which deep neural networks (DNN) have shown promising results. Unfortunately, the general micro-controller-class processors which are widely used in sensing system fail to achieve real-time inference. Accelerating the compute-intensive DNN inference is, therefore, of utmost importance. As the physical limitation of sensing devices, the design of processor needs to meet the balanced performance metrics, including low power consumption, low latency, and flexible configuration. In this paper, we proposed a lightweight pipeline integrated deep learning architecture, which is compatible with open-source RISC-V instructions. The dataflow of DNN is organized by the very long instruction word (VLIW) pipeline. It combines with the proposed special intelligent enhanced instructions and the single instruction multiple data (SIMD) parallel processing unit. Experimental results show that total power consumption is about 411 mw and the power efficiency is about 320.7 GOPS/W.

show abstract

Enabling mixed-precision quantized neural networks in extreme-edge devices

Cited by 23 publications

References 6 publications

Low-precision Logarithmic Number Systems

Low-precision Logarithmic Number Systems

Dustin: A 16-Cores Parallel Ultra-Low-Power Cluster with 2b-to-32b Fully Flexible Bit-Precision and Vector Lockstep Execution Mode

A Heterogeneous RISC-V Processor for Efficient DNN Application in Smart Sensing System

Contact Info

Product

Resources

About