Leveraging Automated Mixed-Low-Precision Quantization for Tiny Edge Microcontrollers

Rusci, Manuele; Fariselli, Marco; Capotondi, Alessandro; Benini, Luca

doi:10.1007/978-3-030-66770-2_22

Cited by 22 publications

(16 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, uniformly quantizing a model to ultra low-precision can cause significant accuracy degradation. It is possible to address this with mixed-precision quantization [51,80,100,180,191,201,226,233,236,250,273]. In this approach, each layer is quantized with different bit precision, as illustrated in Figure 8.…”

Section: B Mixed-precision Quantizationmentioning

confidence: 99%

“…Moreover, support for INT4 quantization was only recently added to TVM [254]. Recent work has shown that low precision and mixed-precision quantization with INT4/INT8 works in practice [51,80,100,106,180,191,201,226,233,233,236,250,254,273]. Thus, developing efficient software APIs for lower precision quantization will have an important impact.…”

Section: Future Directions For Research In Quantizationmentioning

confidence: 99%

See 1 more Smart Citation

A Survey of Quantization Methods for Efficient Neural Network Inference

Gholami¹,

Kim²,

Dong³

et al. 2022

Low-Power Computer Vision

462

143

View full text Add to dashboard Cite

As soon as abstract mathematical computations were adapted to computation on digital computers, the problem of efficient representation, manipulation, and communication of the numerical values in those computations arose. Strongly related to the problem of numerical representation is the problem of quantization: in what manner should a set of continuous real-valued numbers be distributed over a fixed discrete set of numbers to minimize the number of bits required and also to maximize the accuracy of the attendant computations? This perennial problem of quantization is particularly relevant whenever memory and/or computational resources are severely restricted, and it has come to the forefront in recent years due to the remarkable performance of Neural Network models in computer vision, natural language processing, and related areas. Moving from floating-point representations to low-precision fixed integer values represented in four bits or less holds the potential to reduce the memory footprint and latency by a factor of 16x; and, in fact, reductions of 4x to 8x are often realized in practice in these applications. Thus, it is not surprising that quantization has emerged recently as an important and very active sub-area of research in the efficient implementation of computations associated with Neural Networks. In this article, we survey approaches to the problem of quantizing the numerical values in deep Neural Network computations, covering the advantages/disadvantages of current methods. With this survey and its organization, we hope to have presented a useful snapshot of the current research in quantization for Neural Networks and to have given an intelligent organization to ease the evaluation of future research in this area.

show abstract

Section: B Mixed-precision Quantizationmentioning

confidence: 99%

Section: Future Directions For Research In Quantizationmentioning

confidence: 99%

A Survey of Quantization Methods for Efficient Neural Network Inference

Gholami¹,

Kim²,

Dong³

et al. 2022

Low-Power Computer Vision

462

143

View full text Add to dashboard Cite

show abstract

“…Consequently, it is highly suggested to develop a new federated learning scheme to take into consideration the trade-off between utilization of communication resources and convergence rate throughout the training phase. Numerous quantization methods might be applied for federated learning across IoT networks including hyper-sphere quantization [78], low precision quantizer [79,80], and universal vector quantization [81].…”

Section: • Sparsification-empowered Federated Learningmentioning

confidence: 99%

Challenges, Opportunities, and Future Prospects

Abdel‐Basset

Moustafa

Hawash

et al. 2021

Deep Learning Techniques for IoT Security and Privacy

View full text Add to dashboard Cite

“…Similarly, MCUNet (Lin et al, 2020) uses evolutionary search to design NNs for larger MCUs (2MB eFlash / 512KB SRAM) and larger datasets including visual wakewords (VWW) (Chowdhery et al, 2019) and keyword spotting (KWS) (Warden, 2018)). Reinforcement learning (RL) has also been used to choose quantization options in order to help fit an ImageNet model onto a larger MCU (2MB eFlash) (Rusci et al, 2020b). As well as images, audio tasks are an important driver for TinyML.…”

Section: Machine Learningmentioning

confidence: 99%

“…When compared to the sub-byte kernels of CMix-NN (Capotondi et al, 2020), our 4−bit kernels can substantially hide the latency overhead due to software-emulation of 4−bit operations, by fully exploiting the available instruction-level-parallelism (ILP) bandwidth on Cortex-M microcontrollers. Furthermore, we believe the accuracy of the 4−bit KWS MicroNet can be further improved by selectively quantizing lightweight depthwise layers to 8−bits, while quantizing remaining memory-and latency-heavy pointwise and standard convolutional layers to 4-bits (Rusci et al, 2020a;Gope et al, 2020). Table 2.…”

Section: Abstractpotting (Kws)mentioning

confidence: 99%

MicroNets: Neural Network Architectures for Deploying TinyML Applications on Commodity Microcontrollers

Banbury,

Zhou,

Fedorov

et al. 2020

Preprint

View full text Add to dashboard Cite

Executing machine learning workloads locally on resource constrained microcontrollers (MCUs) promises to drastically expand the application space of IoT. However, so-called TinyML presents severe technical challenges, as deep neural network inference demands a large compute and memory budget. To address this challenge, neural architecture search (NAS) promises to help design accurate ML models that meet the tight MCU memory, latency and energy constraints. A key component of NAS algorithms is their latency/energy model, i.e., the mapping from a given neural network architecture to its inference latency/energy on an MCU. In this paper, we observe an intriguing property of NAS search spaces for MCU model design: on average, model latency varies linearly with model operation (op) count under a uniform prior over models in the search space. Exploiting this insight, we employ differentiable NAS (DNAS) to search for models with low memory usage and low op count, where op count is treated as a viable proxy to latency. Experimental results validate our methodology, yielding our MicroNet models, which we deploy on MCUs using Tensorflow Lite Micro, a standard open-source NN inference runtime widely used in the TinyML community. MicroNets demonstrate state-of-the-art results for all three TinyMLperf industry-standard benchmark tasks: visual wake words, audio keyword spotting, and anomaly detection.

show abstract

Leveraging Automated Mixed-Low-Precision Quantization for Tiny Edge Microcontrollers

Cited by 22 publications

References 9 publications

A Survey of Quantization Methods for Efficient Neural Network Inference

A Survey of Quantization Methods for Efficient Neural Network Inference

Challenges, Opportunities, and Future Prospects

MicroNets: Neural Network Architectures for Deploying TinyML Applications on Commodity Microcontrollers

Contact Info

Product

Resources

About