Run-Time Efficient RNN Compression for Inference on Edge Devices

Thakker, Urmish; Beu, Jesse; Gope, Dibakar; Dasika, Ganesh; Mattina, Matthew

doi:10.1109/emc249363.2019.00013

Cited by 17 publications

(8 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Analo-gously, [53] proposes hybrid network architectures combing binary and full-precision sections to achieve significant energy efficiency and memory compression with performance guaranteed. Thakker et al study a compressed RNN cell implementation called Hybrid Matrix Decomposition (HMD) for model inference [54]. It divides the matrix of network weights into two parts: an unconstrained upper half and a lower half composed of rank-1 blocks.…”

Section: A State Of the Art 1) Model Compressionmentioning

confidence: 99%

Edge Intelligence: The Confluence of Edge Computing and Artificial Intelligence

Deng

Zhao

Fang

et al. 2020

IEEE Internet Things J.

721

224

View full text Add to dashboard Cite

Along with the deepening development in communication technologies and the surge of mobile devices, a brandnew computation paradigm, Edge Computing, is surging in popularity. Meanwhile, Artificial Intelligence (AI) applications are thriving with the breakthroughs in deep learning and the upgrade of hardware architectures. Billions of bytes of data, generated at the network edge, put great demands on data processing and structural optimization. Therefore, there exists a strong demand to integrate Edge Computing and AI, which gives birth to Edge Intelligence. In this article, we divide Edge Intelligence into AI for edge (Intelligence-enabled Edge Computing) and AI on edge (Artificial Intelligence on Edge). The former focuses on providing a more optimal solution to the key concerns in Edge Computing with the help of popular and effective AI technologies while the latter studies how to carry out the entire process of building AI models, i.e., model training and inference, on edge. This article focuses on giving insights into this new inter-disciplinary field from a broader vision and perspective. It discusses the core concepts and the research road-map, which should provide the necessary background for potential future research programs in Edge Intelligence.

show abstract

Section: A State Of the Art 1) Model Compressionmentioning

confidence: 99%

Edge Intelligence: The Confluence of Edge Computing and Artificial Intelligence

Deng

Zhao

Fang

et al. 2020

IEEE Internet Things J.

721

224

View full text Add to dashboard Cite

show abstract

“…Tensor factorization is an approach that decomposes a single layer into smaller, more efficient layers [15,[21][22][23][24][25]. Such decomposition creates a smaller network.…”

Section: Tensor Factorizationmentioning

confidence: 99%

Pushing the Envelope of Dynamic Spatial Gating technologies

Huang

Thakker²,

Gope³

et al. 2020

Proceedings of the 2nd International Workshop on Challenges in Artificial Intelligence and Machine Learning for Internet of Thi

Self Cite

View full text Add to dashboard Cite

There has been a recent surge in interest in dynamic inference technologies which can reduce the cost of inference, without sacrificing the accuracy of the model. These models are based on the assumption that not all parts of the output feature map (OFM) are equally important for all inputs. The parts of the output feature maps that are deemed unimportant for a certain input can be skipped entirely or computed at a lower precision, leading to reduced number of computation. In this paper we focus on one such technology that targets unimportant features in the spatial domain of OFM, called Precision Gating (PG). PG computes most features in low precision, to identify regions in the OFM where an object of interest is present, and computes high precision OFM for that region only. We show that PG leads to loss in accuracy when we push the MAC reduction achieved by a PG network. We identify orthogonal dynamic optimization opportunities not exploited by PG and show that the combined technologies can achieve far better results than their individual baseline. This Hybrid Model can achieve 1.92x computation savings on a CIFAR-10 model at an accuracy of 91.35%. At a similar computation savings, the PG model achieves an accuracy of 89.9%. Additionally, we show that PG leads to GEMM computations that are not hardware aware and propose a fix that makes PG technique CPU friendly without losing accuracy. CCS CONCEPTS • Computing methodologies → Neural networks.

show abstract

“…Designing efficient CNNs has been an active, ongoing area of research. In recent years, numerous research efforts have been devoted to compressing neural networks through the use of model pruning [10], quantization [7][8][9], low-rank tensor decomposition [28][29][30][31], compact network architecture design [32], etc.…”

Section: Related Workmentioning

confidence: 99%

Understanding the Impact of Dynamic Channel Pruning on Conditionally Parameterized Convolutions

Raju

Gope²,

Thakker³

et al. 2020

Proceedings of the 2nd International Workshop on Challenges in Artificial Intelligence and Machine Learning for Internet of Thi

Self Cite

View full text Add to dashboard Cite

Recent trends have shown that deep learning models have become larger and more accurate at an increased computational cost, making them difficult to deploy for latency-constrained applications. Conditional execution methods address this increase in cost while preserving the accuracy of the model by only performing the most important computation on important features. In this paper, we analyze a recent method, Feature Boosting and Suppression (FBS), which dynamically assesses which channels contain the most important input-dependent features and prune the others based on a runtime threshold gating mechanism. FBS is able to dynamically prune convolution filters with little loss in accuracy at conservative pruning rates. However, at aggressive pruning rates FBS suffers from heavy accuracy loss in a similar way to aggressive static pruning, due to a low number of active filters and correspondingly narrower effective network per inference. Conditionally parameterized convolutions (CondConv) is another work in the conditional execution domain that increases performance at small computation cost through the generation of input-dependent expert filters. We discover that substituting standard convolutional filters with input-specific filters, as described in CondConv, we enable FBS to address this accuracy loss because CondConv expert filters are able to bolster the narrower network's reduced capacity and capture distinct features from various classes in a single composite filter as needed, making aggressive pruning appealing again. We test our FBS-pruned CondConv model on CIFAR-10 with a custom 9-layer CNN and demonstrate that we can achieve up to 47.

show abstract

Run-Time Efficient RNN Compression for Inference on Edge Devices

Cited by 17 publications

References 18 publications

Edge Intelligence: The Confluence of Edge Computing and Artificial Intelligence

Edge Intelligence: The Confluence of Edge Computing and Artificial Intelligence

Pushing the Envelope of Dynamic Spatial Gating technologies

Understanding the Impact of Dynamic Channel Pruning on Conditionally Parameterized Convolutions

Contact Info

Product

Resources

About