Hanrui Wang scite author profile

Model compression is a critical technique to efficiently deploy neural network models on mobile devices which have limited computation resources and tight power budgets. Conventional model compression techniques rely on hand-crafted heuristics and rule-based policies that require domain experts to explore the large design space trading off among model size, speed, and accuracy, which is usually sub-optimal and time-consuming. In this paper, we propose AutoML for Model Compression (AMC) which leverage reinforcement learning to provide the model compression policy. This learning-based compression policy outperforms conventional rule-based compression policy by having higher compression ratio, better preserving the accuracy and freeing human labor. Under 4× FLOPs reduction, we achieved 2.7% better accuracy than the handcrafted model compression policy for VGG-16 on ImageNet. We applied this automated, push-the-button compression pipeline to MobileNet and achieved 1.81× speedup of measured inference latency on an Android phone and 1.43× speedup on the Titan XP GPU, with only 0.1% loss of ImageNet Top-1 accuracy. Reward= -Error*log(FLOP) Agent: DDPG Action: Compress with Sparsity ratio at (e.g. 50%) Embedding st=[N,C,H,W,i…] Environment: Channel Pruning Layer t-1 Layer t Layer t+1 Critic Actor Embedding Original NN Model Compression by Human: Labor Consuming, Sub-optimal Model Compression by AI: Automated, Higher Compression Rate, Faster Compressed NN AMC Engine Original NN Compressed NN 30% 50% ? %

show abstract

Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution

Tang

et al. 2020

View full text Add to dashboard Cite

HAT: Hardware-Aware Transformers for Efficient Natural Language Processing

et al. 2020

View full text Add to dashboard Cite

Transformers are ubiquitous in Natural Language Processing (NLP) tasks, but they are difficult to be deployed on hardware due to the intensive computation. To enable low-latency inference on resource-constrained hardware platforms, we propose to design Hardware-Aware Transformers (HAT) with neural architecture search. We first construct a large design space with arbitrary encoder-decoder attention and heterogeneous layers. Then we train a Super-Transformer that covers all candidates in the design space, and efficiently produces many SubTransformers with weight sharing. Finally, we perform an evolutionary search with a hardware latency constraint to find a specialized SubTransformer dedicated to run fast on the target hardware. Extensive experiments on four machine translation tasks demonstrate that HAT can discover efficient models for different hardware (CPU, GPU, IoT device). When running WMT'14 translation task on Raspberry Pi-4, HAT can achieve 3× speedup, 3.7× smaller size over baseline Transformer; 2.7× speedup, 3.6× smaller size over Evolved Transformer with 12,041× less search cost and no performance loss. HAT is open-sourced.

show abstract

APQ: Joint Search for Network Architecture, Pruning and Quantization Policy

et al. 2020

View full text Add to dashboard Cite

We present APQ for efficient deep learning inference on resource-constrained hardware. Unlike previous methods that separately search the neural architecture, pruning policy, and quantization policy, we optimize them in a joint manner. To deal with the larger design space it brings, a promising approach is to train a quantization-aware accuracy predictor to quickly get the accuracy of the quantized model and feed it to the search engine to select the best fit. However, training this quantization-aware accuracy predictor requires collecting a large number of quantized model, accuracy pairs, which involves quantization-aware finetuning and thus is highly time-consuming. To tackle this challenge, we propose to transfer the knowledge from a fullprecision (i.e., fp32) accuracy predictor to the quantizationaware (i.e., int8) accuracy predictor, which greatly improves the sample efficiency. Besides, collecting the dataset for the fp32 accuracy predictor only requires to evaluate neural networks without any training cost by sampling from a pretrained once-for-all [3] network, which is highly efficient. Extensive experiments on ImageNet demonstrate the benefits of our joint optimization approach. With the same accuracy, APQ reduces the latency/energy by 2×/1.3× over MobileNetV2+HAQ [30,36]. Compared to the separate optimization approach (ProxylessNAS+AMC+HAQ [5,12,36]), APQ achieves 2.3% higher ImageNet accuracy while reducing orders of magnitude GPU hours and CO 2 emission, pushing the frontier for green AI that is environmentalfriendly. The code and video are publicly available.

show abstract

SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning

2021

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Hanrui Wang

AMC: AutoML for Model Compression and Acceleration on Mobile Devices

Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution

HAT: Hardware-Aware Transformers for Efficient Natural Language Processing

APQ: Joint Search for Network Architecture, Pruning and Quantization Policy

SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning

Contact Info

Product

Resources

About