HAT: Hardware-Aware Transformers for Efficient Natural Language Processing

Wang, Hanrui; Wu, Zhanghao; Liu, Zhijian; Cai, Han; Zhu, Lei; Gan, Chuang; Han, Song

doi:10.18653/v1/2020.acl-main.686

Cited by 164 publications

(114 citation statements)

References 29 publications

Supporting

Mentioning

113

Contrasting

Order By: Relevance

“…We use Lat(•) to predict the latency of the candidates to filter out the candidates that do not meet the latency constraint. Lat(•) is built with the method by Wang et al (2020a), which first samples about 10k architectures from A and collects their inference time on target devices, and then uses a feed-forward network to fit the data. For more details of evolutionary algorithm, please refer to Appendix C. Note that we can use different methods in search process, such as random search and more advanced search, which is left as future work.…”

Section: Search Processmentioning

confidence: 99%

“…More- 6 The first 16 models https://github.com/ google-research/bert from 2L128D to 8L768D. 4-192-768-12-192 4-256-480-12-192 17.0/17.0× over, we introduce HAT (Wang et al, 2020a), as a baseline of one-shot learning. HAT focuses on the search space of non-identical layer structures.…”

Section: Ablation Study Of One-shot Learningmentioning

confidence: 99%

“…In this work, we only consider the case of identical structure for each Transformer layer, instead of the non-identical Transformer (Wang et al, 2020a) or other heterogeneous modules (Xu et al, 2021) (such as convolution units). It has two advantages: (1) it reduces an exponential search space of O( * for m = 1 → M do 7:…”

Section: Search Spacementioning

confidence: 99%

“…To make SuperPLM more effective, we propose practical techniques including the head sub-matrix extraction and efficient batch-wise training, and particularly limit the search space to the models with identical layer structure. Furthermore, by using SuperPLM, we leverage search algorithm (Xie and Yuille, 2017;Wang et al, 2020a) to find hyperparameters for various latency constraints.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

AutoTinyBERT: Automatic Hyper-parameter Optimization for Efficient Pre-trained Language Models

Yin¹,

Chen²,

Shang³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Pre-trained language models (PLMs) have achieved great success in natural language processing. Most of PLMs follow the default setting of architecture hyper-parameters (e.g., the hidden dimension is a quarter of the intermediate dimension in feed-forward sub-networks) in BERT (Devlin et al., 2019). Few studies have been conducted to explore the design of architecture hyper-parameters in BERT, especially for the more efficient PLMs with tiny sizes, which are essential for practical deployment on resource-constrained devices. In this paper, we adopt the one-shot Neural Architecture Search (NAS) to automatically search architecture hyper-parameters. Specifically, we carefully design the techniques of one-shot learning and the search space to provide an adaptive and efficient development way of tiny PLMs for various latency constraints. We name our method AutoTinyBERT 1 and evaluate its effectiveness on the GLUE and SQuAD benchmarks. The extensive experiments show that our method outperforms both the SOTA searchbased baseline (NAS-BERT) and the SOTA distillation-based methods (such as DistilBERT, TinyBERT, MiniLM and MobileBERT). In addition, based on the obtained architectures, we propose a more efficient development method that is even faster than the development of a single PLM.

show abstract

Section: Search Processmentioning

confidence: 99%

Section: Ablation Study Of One-shot Learningmentioning

confidence: 99%

Section: Search Spacementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

AutoTinyBERT: Automatic Hyper-parameter Optimization for Efficient Pre-trained Language Models

Yin¹,

Chen²,

Shang³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

show abstract

“…Traditional NAS methods Zhu et al, 2020) use downstream task performance as the objective to search for task-specific models. Instead, similar to the work by Khetan and Karnin (2020) et al, 2015).…”

Section: Evolutionary Searchmentioning

confidence: 99%

LV-BERT: Exploiting Layer Variety for BERT

Yu¹,

Jiang²,

Chen³

et al. 2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

Modern pre-trained language models are mostly built upon backbones stacking selfattention and feed-forward layers in an interleaved order. In this paper, beyond this stereotyped layer pattern, we aim to improve pre-trained models by exploiting layer variety from two aspects: the layer type set and the layer order. Specifically, besides the original self-attention and feed-forward layers, we introduce convolution into the layer type set, which is experimentally found beneficial to pre-trained models. Furthermore, beyond the original interleaved order, we explore more layer orders to discover more powerful architectures. However, the introduced layer variety leads to a large architecture space of more than billions of candidates, while training a single candidate model from scratch already requires huge computation cost, making it not affordable to search such a space by directly training large amounts of candidate models. To solve this problem, we first pre-train a supernet from which the weights of all candidate models can be inherited, and then adopt an evolutionary algorithm guided by pre-training accuracy to find the optimal architecture. Extensive experiments show that LV-BERT model obtained by our method outperforms BERT and its variants on various downstream tasks. For example, LV-BERT-small achieves 78.8 on the GLUE testing set, 1.8 higher than the strong baseline ELECTRA-small. 1

show abstract