Structural pruning of neural network parameters reduces computation, energy, and memory transfer costs during inference. We propose a novel method that estimates the contribution of a neuron (filter) to the final loss and iteratively removes those with smaller scores. We describe two variations of our method using the first and secondorder Taylor expansions to approximate a filter's contribution. Both methods scale consistently across any network layer without requiring per-layer sensitivity analysis and can be applied to any kind of layer, including skip connections. For modern networks trained on ImageNet, we measured experimentally a high (>93%) correlation between the contribution computed by our methods and a reliable estimate of the true importance. Pruning with the proposed methods leads to an improvement over state-ofthe-art in terms of accuracy, FLOPs, and parameter reduction. On ResNet-101, we achieve a 40% FLOPS reduction by removing 30% of the parameters, with a loss of 0.02% in the top-1 accuracy on ImageNet. Code is available at https://github.com/NVlabs/Taylor_pruning.
We present two techniques to improve landmark localization in images from partially annotated datasets. Our primary goal is to leverage the common situation where precise landmark locations are only provided for a small data subset, but where class labels for classification or regression tasks related to the landmarks are more abundantly available. First, we propose the framework of sequential multitasking and explore it here through an architecture for landmark localization where training with class labels acts as an auxiliary signal to guide the landmark localization on unlabeled data. A key aspect of our approach is that errors can be backpropagated through a complete landmark localization model. Second, we propose and explore an unsupervised learning technique for landmark localization based on having a model predict equivariant landmarks with respect to transformations applied to the image. We show that these techniques, improve landmark prediction considerably and can learn effective detectors even when only a small fraction of the dataset has landmark labels. We present results on two toy datasets and four real datasets, with hands and faces, and report new state-of-the-art on two datasets in the wild, e.g. with only 5% of labeled images we outperform previous state-of-the-art trained on the AFLW dataset. Shapes DatasetBlocks Dataset Model HP: λ = 0, α = 0, γ = 0, β = 1, ADAM Model HP: λ = 1, α = 1, β = 1, ADAM Landmark Localization Network Landmark Localization Network Input = 60 × 60 × 1 Input = 60 × 60 × 1 Conv 7 × 7 × 16, ReLU, stride 1, SAME Conv 9 × 9 × 8, ReLU, stride 1, SAME Conv 7 × 7 × 16, ReLU, stride 1, SAME Conv 9 × 9 × 8, ReLU, stride 1, SAME Conv 7 × 7 × 16, ReLU, stride 1, SAME Conv 9 × 9 × 8, ReLU, stride 1, SAME Conv 7 × 7 × 16, ReLU, stride 1, SAME Conv 9 × 9 × 8, ReLU, stride 1, SAME Conv 7 × 7 × 16, ReLU, stride 1, SAME Conv 9 × 9 × 8, ReLU, stride 1, SAME Conv 7 × 7 × 16, ReLU, stride 1, SAME Conv 9 × 9 × 8, ReLU, stride 1, SAME Conv 1 × 1 × 16, ReLU, stride 1, SAME Conv 1 × 1 × 8, ReLU, stride 1, SAME Conv 1 × 1 × 2, ReLU, stride 1, SAME Conv 1 × 1 × 5, ReLU, stride 1, SAME soft-argmax(num channels=2) soft-argmax(num channels=5) Classification Network Classification Network FC #units = 40, ReLU FC #units = 256, ReLU, dropout-prob=.25 FC #units = 2, Linear FC #units = 256, ReLU, dropout-prob=.25 FC #units = 15, Linear softmax(dim=2) softmax(dim=15)Table S12: Architecture details of Seq-MT model used for Hands and Multi-PIE datasets.Hands Dataset Multi-PIE Dataset Model HP: λ = 0.5, α = 0.3, γ = 10 −5 , β = 0.001, ADAM Model HP: λ = 2, α = 0.3, γ = 10 −5 , β = 0.001, ADAM Preprocessing: scale and translation [-10%, 10%] of face bounding box, rotation [-20, 20] applied randomly to every epoch. Landmark Localization Network Landmark Localization Network Input = 64 × 64 × 1 Input = 64 × 64 × 1
Convolutional neural networks (CNN) are increasingly used in many areas of computer vision. They are particularly attractive because of their ability to "absorb" great quantities of labeled data through millions of parameters. However, as model sizes increase, so do the storage and memory requirements of the classifiers, hindering many applications such as image and speech recognition on mobile phones and other devices. In this paper, we present a novel network architecture, Frequency-Sensitive Hashed Nets (FreshNets), which exploits inherent redundancy in both convolutional layers and fully-connected layers of a deep learning model, leading to dramatic savings in memory and storage consumption. Based on the key observation that the weights of learned convolutional filters are typically smooth and low-frequency, we first convert filter weights to the frequency domain with a discrete cosine transform (DCT) and use a low-cost hash function to randomly group frequency parameters into hash buckets. All parameters assigned the same hash bucket share a single value learned with standard backpropagation. To further reduce model size, we allocate fewer hash buckets to high-frequency components, which are generally less important. We evaluate FreshNets on eight data sets, and show that it leads to better compressed performance than several relevant baselines.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.