As the applications for artificial intelligence are growing rapidly, numerous network compression algorithms have been developed to restrict computing resources such as smartphones, edge, and IoT devices. Knowledge distillation (KD) leverages soft labels derived from a teacher model to a less parameterized model achieving high accuracy with reduced computational burden. Moreover, online KD provides parallel computing through collaborative learning between teacher and student networks, thus enhancing the training speed. A binarized neural network (BNN) offers an intriguing opportunity to facilitate aggressive compression at the expense of drastically degraded accuracy. In this study, two performance improvements are proposed for online KD when a BNN is applied as a student network: 1) parameterized weight clipping (PWC) to reduce dead weights in the student network and 2) quantization gap-aware adaptive temperature scheduling between the teacher and student networks. In contrast to constant weight clipping (CWC), PWC demonstrates a 3.78% top-1 test accuracy enhancement with trainable weight clipping by decreasing the gradient mismatch with CIFAR-10 dataset. Furthermore, the quantization gap-aware temperature scheduling increases the top-1 test accuracy by 0.08% over online KD at a constant temperature. By aggregating both methodologies, the top-1 test accuracy for CIFAR-10 dataset was 94.60%, and that for Tiny-ImageNet dataset was comparable to that of the 32-bit full-precision neural network.INDEX TERMS Neural network compression, knowledge distillation, binarized neural network, parameterized weight clipping, dead weight, adaptive temperature scheduling.
The explosive computation and memory requirements of convolutional neural networks (CNNs) hinder their deployment in resource-constrained devices. Because conventional CNNs perform identical parallelized computations even on redundant pixels, the saliency of various features in an image should be reflected for higher energy efficiency and market penetration. This paper proposes a novel channel and spatial gating network (CSGN) for adaptively selecting vital channels and generating spatial-wise execution masks. A CSGN can be characterized as a dynamic channel and a spatial-aware gating module by maximally utilizing opportunistic sparsity. Extensive experiments were conducted on the CIFAR-10 and ImageNet datasets based on ResNet. The results revealed that, with the proposed architecture, the amount of multiply-accumulate (MAC) operations was reduced by 1.97–11.78× and 1.37–13.12× on CIFAR-10 and ImageNet, respectively, with negligible accuracy degradation in the inference stage compared with the baseline architectures.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.