2021
DOI: 10.1109/tc.2021.3066883
|View full text |Cite
|
Sign up to set email alerts
|

DORY: Automatic End-to-End Deployment of Real-World DNNs on Low-Cost IoT MCUs

Abstract: The deployment of Deep Neural Networks (DNNs) on end-nodes at the extreme edge of the Internet-of-Things is a critical enabler to support pervasive Deep Learning-enhanced applications. Low-Cost MCU-based end-nodes have limited on-chip memory and often replace caches with scratchpads, to reduce area overheads and increase energy efficiency -requiring explicit DMA-based memory transfers between different levels of the memory hierarchy. Mapping modern DNNs on these systems requires aggressive topology-dependent t… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
84
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2

Relationship

2
5

Authors

Journals

citations
Cited by 94 publications
(84 citation statements)
references
References 40 publications
0
84
0
Order By: Relevance
“…: Weight [cm:kg] Power [W] Onboard device standard-size [4] ∼ 50 : ≥ 1 ≥ 100 Desktop micro-size [5] ∼ 25 : ∼ 0.5 ∼ 50 Embedded nano-size [14] ∼ 10 : ∼ 0.01 ∼ 5 MCU pico-size [13] ∼ 2 : ≤ 0.001 ∼ 0.1 ULP both the strict power budget of IoT MCUs and the realtime requirement of autonomous nano-drones; • we present our dataset augmentation methodology, which maximizes the model's generalization capability with synthetic pitch, photometric, optical, and geometric enhancements; • using open-source tools [19], [20], we demonstrate our methodology from perception to control (including training, aggressive 8-bit quantization, CNN deployment, and low-level controller), with no drop in regression performance, even compared to the full precision (float 32-bit) Proximity CNN. We achieve an onboard peak inference performance of 135 frame/s within 86 mW and a top energy efficiency of ∼0.43 mJ/frame; • we experimentally evaluate how the CNN design impacts on i) regression performance, ii) power consumption, iii) inference rate, and iv) closed-loop control accuracy; • we prove our methodology in the field presenting a closed-loop, fully working demonstration of PULP-Frontnet on a 27-grams nano-UAVs, achieving 100% success-rate on all tests (18 runs on never-seen-before subjects), with behavior comparable with an ideal motioncapture system (median absolute angular error below 5 • );…”
Section: Vehicle Classmentioning
confidence: 99%
See 1 more Smart Citation
“…: Weight [cm:kg] Power [W] Onboard device standard-size [4] ∼ 50 : ≥ 1 ≥ 100 Desktop micro-size [5] ∼ 25 : ∼ 0.5 ∼ 50 Embedded nano-size [14] ∼ 10 : ∼ 0.01 ∼ 5 MCU pico-size [13] ∼ 2 : ≤ 0.001 ∼ 0.1 ULP both the strict power budget of IoT MCUs and the realtime requirement of autonomous nano-drones; • we present our dataset augmentation methodology, which maximizes the model's generalization capability with synthetic pitch, photometric, optical, and geometric enhancements; • using open-source tools [19], [20], we demonstrate our methodology from perception to control (including training, aggressive 8-bit quantization, CNN deployment, and low-level controller), with no drop in regression performance, even compared to the full precision (float 32-bit) Proximity CNN. We achieve an onboard peak inference performance of 135 frame/s within 86 mW and a top energy efficiency of ∼0.43 mJ/frame; • we experimentally evaluate how the CNN design impacts on i) regression performance, ii) power consumption, iii) inference rate, and iv) closed-loop control accuracy; • we prove our methodology in the field presenting a closed-loop, fully working demonstration of PULP-Frontnet on a 27-grams nano-UAVs, achieving 100% success-rate on all tests (18 runs on never-seen-before subjects), with behavior comparable with an ideal motioncapture system (median absolute angular error below 5 • );…”
Section: Vehicle Classmentioning
confidence: 99%
“…• making use of open-source quantization/deployment tools [19], [20], as well as employing a 2× more aggressive quantization scheme (i.e., 8-bits vs. 16-bits); • including the development flow for ad-hoc dataset collection and its augmentation; • proposing a novel streamlined DL model (up to 10× and 8× fewer operations and memory, respectively); • introducing a thorough model-size analysis to study the relation between power consumption, memory constraints, regression performance, and control accuracy. Ultimately, our models push further the onboard NN's inference performance with a peak throughput of 135 frame/s @ 86 mW -PULP-Dronet peaked at 18 frame/s @ 272 mW.…”
Section: Related Workmentioning
confidence: 99%
“…Unlike the research of [ 21 , 23 ], some studies focus on optimizing the software deep learning algorithm to fit existing embedded system-on-chip (SoC) [ 25 , 26 , 27 ]. Adopting the deep learning algorithm with applicable performance to the embedded SoC is extremely hard because of the critically limited memory and storage resources compared to cloud AI or mobile AI devices [ 25 , 26 , 27 ]. In order to fix these problems, the authors of [ 25 , 27 ] proposed frameworks for optimized neural network generation.…”
Section: Related Workmentioning
confidence: 99%
“…Adopting the deep learning algorithm with applicable performance to the embedded SoC is extremely hard because of the critically limited memory and storage resources compared to cloud AI or mobile AI devices [ 25 , 26 , 27 ]. In order to fix these problems, the authors of [ 25 , 27 ] proposed frameworks for optimized neural network generation. Both frameworks provide quantization of floating-point arithmetic to integer arithmetic and applying memory constants for scaling the neural network for each device.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation