Ultra-Low Power DNN Accelerators for IoT

Moss, Arthur J.; Lee, Hyun-Jong; Lei, Xun; Min, Chulhong; Kawsar, Fahim; Montanari, A.

doi:10.1145/3560905.3568300

Cited by 9 publications

(6 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Giordano et al [47] benchmark a single architecture for image classification on several different platforms. Moss et al [48] evaluate different image classification architectures on a single platform, MAX78000. Unlike these works, we describe the full deployment pipeline in the context of object detection, from architecture exploration to quantization and hardware-optimized implementation.…”

Section: Related Workmentioning

confidence: 99%

Flexible and Fully Quantized Lightweight TinyissimoYOLO for Ultra-Low-Power Edge Systems

Moosmann,

Müller,

Zimmerman

et al. 2024

IEEE Access

View full text Add to dashboard Cite

This paper deploys and explores variants of TinyissimoYOLO, a highly flexible and fully quantized ultra-lightweight object detection network designed for edge systems with a power envelope of a few milliwatts. With experimental measurements, we present a comprehensive characterization of the network's detection performance, exploring the impact of various parameters, including input resolution, number of object classes, and hidden layer adjustments. We deploy variants of TinyissimoYOLO on state-of-the-art ultra-low-power extreme edge platforms, presenting a detailed comparison on latency, energy efficiency, and their ability to efficiently parallelize the workload. In particular, the paper presents a comparison between a RISC-V-based parallel processor (GAP9 from GreenWaves Technologies) with and without use of its on-chip hardware accelerator, an ARM Cortex-M7 core (STM32H7 from ST Microelectronics), two ARM Cortex-M4 cores (STM32L4 from ST Microelectronics and Apollo4b from Ambiq), and a multi-core platform aimed at edge AI applications with a CNN hardware accelerator (MAX78000 from Analog Devices). Experimental results show that the GAP9's hardware accelerator achieves the lowest inference latency and energy at 2.12 ms and 150 µJ respectively, which is around 2x faster and 20% more energy efficient than the next best platform, the MAX78000. The hardware accelerator of GAP9 can even run an increased resolution version of TinyissimoYOLO with 112 × 112 pixels and 10 detection classes within 3.2 ms, consuming 245 µJ. We also deployed and profiled a multi-core implementation on GAP9 at different core voltages and frequencies, achieving 11.3 ms with the lowest-latency and 490 µJ with the most energy-efficient configuration. With this paper, we demonstrate the flexibility of TinyissimoYOLO and prove its detection accuracy on a widely used detection dataset. Furthermore, we demonstrate its suitability for real-time ultra-low-power edge inference.

show abstract

Section: Related Workmentioning

confidence: 99%

Flexible and Fully Quantized Lightweight TinyissimoYOLO for Ultra-Low-Power Edge Systems

Moosmann,

Müller,

Zimmerman

et al. 2024

IEEE Access

View full text Add to dashboard Cite

show abstract

“…However, when similar models are executed using such MCU integrating multiple convolutional engines, we can expect the same latency baseline of Google Coral, although lower in magnitude. Recent works in the literature show that the MAX78000 base execution units typically require two-dimensional inputs and in case less computing resources are required, data is filled with zeros, thus resulting in a latency baseline mostly independent from the operation size (for instance, a network with 4 input channels, 3 × 3 kernel size, padding of one, and 4 to 64 output channels, leads to a constant baseline latency of ∼ 150 µs, by far larger than SPLVP) [18]. The MLP models investigated in this work have a small size (the biggest Avila model counts 270 neurons).…”

Section: ) Other Acceleratorsmentioning

confidence: 99%

“…In general, complex accelerators designed to support large Convolutional Neural Networks (CNNs) typically provide substantial latency for very small models because the internal logic is typically underutilized; moreover, they require interfacing with middle-end processors and operating systems. For instance, in the MAX78000 SoC, convolutional and linear filters with input data size of 16 × 16 provide a latency of ∼ 75 µs, irrespective of filter size, while a single two-dimensional convolutional layer with four output channels requires ∼ 150 µs, irrespective of the number of input channels [18]. Similarly, the Google Edge TPU platform suffers from extreme underutilization of its Processing Elements (PEs) and inefficient sequential scheduling of Fully Connected (FC) layers [19].…”

Section: Introductionmentioning

confidence: 99%

An 8-bit Single Perceptron Processing Unit for Tiny Machine Learning Applications

Crepaldi,

Salvo,

Merello

2023

IEEE Access

View full text Add to dashboard Cite

We present a tiny MultiLayer Perceptron (MLP) accelerator named Single Perceptron Linear Vector Processor (SPLVP) that aims at extending the capabilities of limited resources MCUs, enabling inference time speedup and main CPU off-load. It is based on a single perceptron hardware unit, enhanced with an additional accumulation input and scaling features, that is sequentially scheduled to cover all the nodes of the network. The accelerator supports both linear and Rectified Linear Unit (ReLU) activation and its firmware can be generated from 8-bit tflite quantized models. We also present a complete design toolchain that encompasses supervised learning, compilation, assembly, simulation, and device programming. The hardware support for extra accumulation input and scaling, together with the processor memory partitioning, are the key features that enable significant speedups. By solving a toy recognition problem based on image data captured from an infra-red camera, measurements show that the execution speed of SPLVP at 80 MHz outperforms an ARM Cortex-M4 STM32L476 microcontroller by a factor of 9.2 when the same ANN is translated to MCU code using the STM CubeMX-Ai converter at the same clock frequency. SPLVP is synthesized on a low-cost and gate-count Cyclone 10 LP FPGA resulting in an 18% logic and 77% memory occupation. The SPLVP assembly code can be directly converted into a VHDL description that directly hardcodes the ANN. The execution speed of an ANN model for Iris classification, fully synthesized, improves by a factor of 209 compared to firmware execution on the MCU. To verify the operation of SPLVP and its design framework, we have designed various tiny Machine Learning (ML) classifiers, for which we briefly discuss the obtained performance and the preprocessing techniques used. Across all these classifiers, the obtained speedup compared to the STM32 is 8.3-14.9 ×. INDEX TERMSNeural processing unit, multilayer perceptron, single perceptron linear vector processor, fully connected neural network, FPGA, compiler, design toolchain, MCU, tiny machine learning.

show abstract

“…The generated attention map, AttMap l , is fed into a classifier along with the feature map of the current video frame F l i+j in order to detect any semantic variation. The classifier component represented by the function Z Class is a classifier with two outputs, which is parameterized by θ Class in equation (7). It generates a class label using the attention map AttMap l and the feature map from the current video frame F l i+j .…”

Section: B Temporal Early Exit Module (Teem)mentioning

confidence: 99%

“…Object detection in static images has achieved remarkable successes in recent years using CNNs [3]. However, beyond individual images, video object detection has emerged as a new challenge, particularly when deployed on various embedded devices with limited computation and energy resources [4], [5], [6], [7]. This is due to the high computational cost introduced by applying existing image object detection networks in real-time on numerous individual video frames [8].…”

Section: Introductionmentioning

confidence: 99%