DORY: Automatic End-to-End Deployment of Real-World DNNs on Low-Cost IoT MCUs

Burrello, Alessio; Garofalo, Angelo; Bruschi, Nazareno; Tagliavini, Giuseppe; Rossi, Davide; Conti, Francesco

doi:10.1109/tc.2021.3066883

Cited by 94 publications

(84 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…: Weight [cm:kg] Power [W] Onboard device standard-size [4] ∼ 50 : ≥ 1 ≥ 100 Desktop micro-size [5] ∼ 25 : ∼ 0.5 ∼ 50 Embedded nano-size [14] ∼ 10 : ∼ 0.01 ∼ 5 MCU pico-size [13] ∼ 2 : ≤ 0.001 ∼ 0.1 ULP both the strict power budget of IoT MCUs and the realtime requirement of autonomous nano-drones; • we present our dataset augmentation methodology, which maximizes the model's generalization capability with synthetic pitch, photometric, optical, and geometric enhancements; • using open-source tools [19], [20], we demonstrate our methodology from perception to control (including training, aggressive 8-bit quantization, CNN deployment, and low-level controller), with no drop in regression performance, even compared to the full precision (float 32-bit) Proximity CNN. We achieve an onboard peak inference performance of 135 frame/s within 86 mW and a top energy efficiency of ∼0.43 mJ/frame; • we experimentally evaluate how the CNN design impacts on i) regression performance, ii) power consumption, iii) inference rate, and iv) closed-loop control accuracy; • we prove our methodology in the field presenting a closed-loop, fully working demonstration of PULP-Frontnet on a 27-grams nano-UAVs, achieving 100% success-rate on all tests (18 runs on never-seen-before subjects), with behavior comparable with an ideal motioncapture system (median absolute angular error below 5 • );…”

Section: Vehicle Classmentioning

confidence: 99%

“…• making use of open-source quantization/deployment tools [19], [20], as well as employing a 2× more aggressive quantization scheme (i.e., 8-bits vs. 16-bits); • including the development flow for ad-hoc dataset collection and its augmentation; • proposing a novel streamlined DL model (up to 10× and 8× fewer operations and memory, respectively); • introducing a thorough model-size analysis to study the relation between power consumption, memory constraints, regression performance, and control accuracy. Ultimately, our models push further the onboard NN's inference performance with a peak throughput of 135 frame/s @ 86 mW -PULP-Dronet peaked at 18 frame/s @ 272 mW.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Fully Onboard AI-Powered Human-Drone Pose Estimation on Ultralow-Power Autonomous Flying Nano-UAVs

Palossi

Zimmerman

Burrello

et al. 2022

IEEE Internet Things J.

Self Cite

View full text Add to dashboard Cite

Many emerging applications of nano-sized unmanned aerial vehicles (UAVs), with a few cm 2 form-factor, revolve around safely interacting with humans in complex scenarios, for example, monitoring their activities or looking after people needing care. Such sophisticated autonomous functionality must be achieved while dealing with severe constraints in payload, battery, and power budget (∼100 mW). In this work, we attack a complex task going from perception to control: to estimate and maintain the nano-UAV's relative 3D pose with respect to a person while they freely move in the environment -a task that, to the best of our knowledge, has never previously been targeted with fully onboard computation on a nano-sized UAV. Our approach is centered around a novel vision-based deep neural network (DNN), called PULP-Frontnet, designed for deployment on top of a parallel ultra-low-power (PULP) processor aboard a nano-UAV. We present a vertically integrated approach starting from the DNN model design, training, and dataset augmentation down to 8-bit quantization and deployment in-field. PULP-Frontnet can operate in real-time (up to 135 frame/s), consuming less than 87 mW for processing at peak throughput and down to 0.43 mJ/frame in the most energy-efficient operating point. Field experiments demonstrate a closed-loop top-notch autonomous navigation capability, with a tiny 27-grams Crazyflie 2.1 nano-UAV. Compared against an ideal sensing setup, onboard pose inference yields excellent drone behavior in terms of median absolute errors, such as positional (onboard: 41 cm, ideal: 26 cm) and angular (onboard: 3.7 • , ideal: 4.1 • ). We publicly release videos and the source code of our work.

show abstract

Section: Vehicle Classmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Fully Onboard AI-Powered Human-Drone Pose Estimation on Ultralow-Power Autonomous Flying Nano-UAVs

Palossi

Zimmerman

Burrello

et al. 2022

IEEE Internet Things J.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Unlike the research of [ 21 , 23 ], some studies focus on optimizing the software deep learning algorithm to fit existing embedded system-on-chip (SoC) [ 25 , 26 , 27 ]. Adopting the deep learning algorithm with applicable performance to the embedded SoC is extremely hard because of the critically limited memory and storage resources compared to cloud AI or mobile AI devices [ 25 , 26 , 27 ]. In order to fix these problems, the authors of [ 25 , 27 ] proposed frameworks for optimized neural network generation.…”

Section: Related Workmentioning

confidence: 99%

“…Adopting the deep learning algorithm with applicable performance to the embedded SoC is extremely hard because of the critically limited memory and storage resources compared to cloud AI or mobile AI devices [ 25 , 26 , 27 ]. In order to fix these problems, the authors of [ 25 , 27 ] proposed frameworks for optimized neural network generation. Both frameworks provide quantization of floating-point arithmetic to integer arithmetic and applying memory constants for scaling the neural network for each device.…”

Section: Related Workmentioning

confidence: 99%

“…As a result, adopting the framework results in a remarkable reduction in latency and maximum SRAM usage. The framework in [ 27 ] concentrates on software optimization, which can be achieved by reflecting the configuration of mirror layered memory architecture and DMA provided by the target hardware. The result shows that the memory transferring overhead caused by cache memory is almost hidden.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

A Multi-Core Controller for an Embedded AI System Supporting Parallel Recognition

Jang

Yoon

et al. 2021

Micromachines

View full text Add to dashboard Cite

Recent advances in artificial intelligence (AI) technology encourage the adoption of AI systems for various applications. In most deployments, AI-based computing systems adopt the architecture in which the central server processes most of the data. This characteristic makes the system use a high amount of network bandwidth and can cause security issues. In order to overcome these issues, a new AI model called federated learning was presented. Federated learning adopts an architecture in which the clients take care of data training and transmit only the trained result to the central server. As the data training from the client abstracts and reduces the original data, the system operates with reduced network resources and reinforced data security. A system with federated learning supports a variety of client systems. To build an AI system with resource-limited client systems, composing the client system with multiple embedded AI processors is valid. For realizing the system with this architecture, introducing a controller to arbitrate and utilize the AI processors becomes a stringent requirement. In this paper, we propose an embedded AI system for federated learning that can be composed flexibly with the AI core depending on the application. In order to realize the proposed system, we designed a controller for multiple AI cores and implemented it on a field-programmable gate array (FPGA). The operation of the designed controller was verified through image and speech applications, and the performance was verified through a simulator.

show abstract