A Survey on the Optimization of Neural Network Accelerators for Micro-AI On-Device Inference

Mazumder, Arnab Neelim; Meng, Jian; Rashid, Hasib-Al; Kallakuri, Utteja; Zhang, Xin; Seo, Jae-sun

doi:10.1109/jetcas.2021.3129415

Cited by 34 publications

(11 citation statements)

References 100 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…( 2) We use ESC-10 dataset [59] for audio anomaly detection and deploy an transformer-based model [60]. (3) We use UCI HAR dataset [42] for motion signal-based human activity recognition and deploy a LSTM-based model. (4) We use MoCap dataset [61] for training a motion signalbased user identification (12 users) model, using a LSTMbased architecture, and deploy it as the inference workload.…”

Section: Implementation and Configurationsmentioning

confidence: 99%

See 1 more Smart Citation

InFi: End-to-End Learning to Filter Input for Resource-Efficiency in Mobile-Centric Inference

Yuan¹,

Zhang²,

He³

et al. 2022

Preprint

View full text Add to dashboard Cite

Mobile-centric AI applications have high requirements for resource-efficiency of model inference. Input filtering is a promising approach to eliminate the redundancy so as to reduce the cost of inference. Previous efforts have tailored effective solutions for many applications, but left two essential questions unanswered: (1) theoretical filterability of an inference workload to guide the application of input filtering techniques, thereby avoiding the trial-and-error cost for resource-constrained mobile applications; (2) robust discriminability of feature embedding to allow input filtering to be widely effective for diverse inference tasks and input content. To answer them, we first formalize the input filtering problem and theoretically compare the hypothesis complexity of inference models and input filters to understand the optimization potential. Then we propose the first end-to-end learnable input filtering framework that covers most state-of-the-art methods and surpasses them in feature embedding with robust discriminability. We design and implement InFi that supports six input modalities and multiple mobile-centric deployments. Comprehensive evaluations confirm our theoretical results and show that InFi outperforms strong baselines in applicability, accuracy, and efficiency. InFi achieve 8.5× throughput and save 95% bandwidth, while keeping over 90% accuracy, for a video analytics application on mobile platforms.

show abstract

Section: Implementation and Configurationsmentioning

confidence: 99%

“…T HE increased computing power of mobile devices and the growing demand for real-time sensor data analytics have created a trend of mobile-centric artificial intelligence (AI) [2], [3], [4], [5]. It is estimated that over 80% of enterprise IoT projects will incorporate AI by 2022.…”

Section: Introductionmentioning

confidence: 99%

InFi: End-to-End Learning to Filter Input for Resource-Efficiency in Mobile-Centric Inference

Yuan¹,

Zhang²,

He³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Moreover, these time-critical applications should often operate on an edge device whose computing power is limited. To port these applications on an edge device, the network compression that optimizes the neural network to satisfy both the time constraints and the performance requirements is essential [ 1 , 2 , 3 , 4 , 5 ]. Network compression consists of network simplification, which simplifies the network architecture, and parameter quantization, which compresses the bit width of parameters to lower than floating point.…”

Section: Introductionmentioning

confidence: 99%

Simplification of Deep Neural Network-Based Object Detector for Real-Time Edge Computing

Choi

Jung

et al. 2023

Sensors

View full text Add to dashboard Cite

This paper presents a method for simplifying and quantizing a deep neural network (DNN)-based object detector to embed it into a real-time edge device. For network simplification, this paper compares five methods for applying channel pruning to a residual block because special care must be taken regarding the number of channels when summing two feature maps. Based on the comparison in terms of detection performance, parameter number, computational complexity, and processing time, this paper discovers the most satisfying method on the edge device. For network quantization, this paper compares post-training quantization (PTQ) and quantization-aware training (QAT) using two datasets with different detection difficulties. This comparison shows that both approaches are recommended in the case of the easy-to-detect dataset, but QAT is preferable in the case of the difficult-to-detect dataset. Through experiments, this paper shows that the proposed method can effectively embed the DNN-based object detector into an edge device equipped with Qualcomm’s QCS605 System-on-Chip (SoC), while achieving a real-time operation with more than 10 frames per second.

show abstract

“…Neural networks require massive computational and memory resources, which make them hard to be deployed in such environments, including lightweight architectures. Thus, optimising DNNs has been a major research topic (Chung and Abdelrahman 2020;Lee et al 2020;Mazumder et al 2021) and has made a series of progress, such as accelerating multi-DNN workloads based on heterogeneous dataflow Kwon et al (2021), compressing DNNs with vectorized weight quantization Gong et al (2021), the delay-aware DNN inference technique Li et al (2021), the error compensation for low-voltage DNN accelerators Ji et al (2021), pruning of redundancy for DNNs ; Ahn and Kim (2022); Camci et al (2022), etc. Although the performance of DNNs has been greatly improved with the support of various strategies including these optimization techniques as well as high-efficiency libraries like oneDNN, there is still much room for further performance improvement.…”

Section: Introductionmentioning

confidence: 99%

An architecture-level analysis on deep learning models for low-impact computations

Wang

Yue

et al. 2022

Artif Intell Rev

View full text Add to dashboard Cite

Deep neural networks (DNNs) have made significant achievements in a wide variety of domains. For the deep learning tasks, multiple excellent hardware platforms provide efficient solutions, including graphics processing units (GPUs), central processing units (CPUs), field programmable gate arrays (FPGAs), and application-specific integrated circuit (ASIC). Nonetheless, CPUs outperform other solutions including GPUs in many cases for the inference workload of DNNs with the support of various techniques, such as the high-performance libraries being the basic building blocks for DNNs. Thus, CPUs have been a preferred choice for DNN inference applications, particularly in the low-latency demand scenarios. However, the DNN inference efficiency remains a critical issue, especially when low latency is required under conditions with limited hardware resources, such as embedded systems. At the same time, the hardware features have not been fully exploited for DNNs and there is much room for improvement. To this end, this paper conducts a series of experiments to make a thorough study for the inference workload of prominent state-of-the-art DNN architectures on a single-instruction-multiple-data (SIMD) CPU platform, as well as with widely applicable scopes for multiple hardware platforms. The study goes into depth in DNNs: the CPU kernel-instruction level performance characteristics of DNNs including branches, branch prediction misses, cache misses, etc, and the underlying convolutional computing mechanism at the SIMD level; The thorough layer-wise time consumption details with potential time-cost bottlenecks; And the exhaustive dynamic activation sparsity with exact details on the redundancy of DNNs. The research provides researchers with comprehensive and insightful details, as well as crucial target areas for optimising and improving the efficiency of DNNs at both the hardware and software levels.

show abstract

A Survey on the Optimization of Neural Network Accelerators for Micro-AI On-Device Inference

Cited by 34 publications

References 100 publications

InFi: End-to-End Learning to Filter Input for Resource-Efficiency in Mobile-Centric Inference

InFi: End-to-End Learning to Filter Input for Resource-Efficiency in Mobile-Centric Inference

Simplification of Deep Neural Network-Based Object Detector for Real-Time Edge Computing

An architecture-level analysis on deep learning models for low-impact computations

Contact Info

Product

Resources

About