TinyBERT: Distilling BERT for Natural Language Understanding

Jiao, Xue; Yin, Yichun; Shang, Lifeng; Jiang, Xin; Chen, Xiao; Li, Linlin; Wang, Fang; Li, Qun

doi:10.18653/v1/2020.findings-emnlp.372

Cited by 973 publications

(1,013 citation statements)

References 19 publications

Supporting

Mentioning

882

Contrasting

Order By: Relevance

“…Notably, all these computations are applied once, for a given catalog, and can be executed in an offline manner and cached for later use. To further accelerate the computation time of the two C T DM scores applied through RecoBERT inference, one can adopt knowledge distillation techniques, such as (Barkan et al, 2019;Jiao et al, 2019;Lioutas et al, 2019), which are beyond the scope of this work.…”

Section: Computational Costsmentioning

confidence: 99%

RecoBERT: A Catalog Language Model for Text-Based Recommendations

Malkiel

Barkan

Caciularu

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Language models that utilize extensive selfsupervised pre-training from unlabeled text, have recently shown to significantly advance the state-of-the-art performance in a variety of language understanding tasks. However, it is yet unclear if and how these recent models can be harnessed for conducting text-based recommendations. In this work, we introduce RecoBERT, a BERT-based approach for learning catalog-specialized language models for text-based item recommendations. We suggest novel training and inference procedures for scoring similarities between pairs of items, that don't require item similarity labels. Both the training and the inference techniques were designed to utilize the unlabeled structure of textual catalogs, and minimize the discrepancy between them. By incorporating four scores during inference, RecoBERT can infer text-based item-to-item similarities more accurately than other techniques. In addition, we introduce a new language understanding task for wine recommendations using similarities based on professional wine reviews. As an additional contribution, we publish annotated recommendations dataset crafted by human wine experts. Finally, we evaluate Re-coBERT and compare it to various state-of-theart NLP models on wine and fashion recommendations tasks.

show abstract

Section: Computational Costsmentioning

confidence: 99%

RecoBERT: A Catalog Language Model for Text-Based Recommendations

Malkiel

Barkan

Caciularu

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

show abstract

“…The knowledge distillation approach enables the transfer of knowledge from a large teacher model to a smaller student model. Such attempts have been made to distill BERT models, e.g., Distil-BERT (Sanh et al, 2019), BERT-PKD (Sun et al, 2019), Distilled BiLSTM (Tang et al, 2019), Tiny-BERT (Jiao et al, 2019), MobileBERT (Sun et al, 2020), etc. All of these methods require carefully designing the student architecture.…”

Section: Pre-trained Language Model Compressionmentioning

confidence: 99%

“…However, these models often consume considerable storage, memory bandwidth, and computational resource. To reduce the model size and increase the inference throughput, compression techniques such as knowledge distillation (Sanh et al, 2019;Sun et al, 2019;Tang et al, 2019;Jiao et al, 2019;Sun et al, 2020) (Sanh et al, 2019) and BERT-PKD (Sun et al, 2019)) and iterative pruning methods (Iterative Pruning (Guo et al, 2019) and our proposed method) in terms of accuracy at various compression rate using MNLI test set. knowledge distillation methods require re-distillation from the teacher to get each single data point, whereas iterative pruning methods can produce continuous curves at once.…”

Section: Introductionmentioning

confidence: 99%

Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior

Lin¹,

Liu²,

Yang³

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Traditional (unstructured) pruning methods for a Transformer model focus on regularizing the individual weights by penalizing them toward zero. In this work, we explore spectralnormalized identity priors (SNIP), a structured pruning approach that penalizes an entire residual module in a Transformer model toward an identity mapping. Our method identifies and discards unimportant non-linear mappings in the residual connections by applying a thresholding operator on the function norm. It is applicable to any structured module, including a single attention head, an entire attention block, or a feed-forward subnetwork. Furthermore, we introduce spectral normalization to stabilize the distribution of the post-activation values of the Transformer layers, further improving the pruning effectiveness of the proposed methodology. We conduct experiments with BERT on 5 GLUE benchmark tasks to demonstrate that SNIP achieves effective pruning results while maintaining comparable performance. Specifically, we improve the performance over the state-of-the-art by 0.5 to 1.0% on average at 50% compression ratio. * Work done as part of the Google AI Residency. † Work done at Google Research.

show abstract

“…Existing BERT-oriented model compression solutions largely depend on knowledge distillation (Hinton et al, 2015), which is inefficient and resource-consuming because a large training corpus is required to learn the behaviors of a teacher. For example, DistilBERT (Sanh et al, 2019) is re-trained on the same corpus as pre-training a vanilla BERT from scratch; and TinyBERT (Jiao et al, 2019) utilizes expensive data augmentation to fit the distillation target. The costs of these model compression methods are as large as pre-training, which are unaffordable for low-resource settings.…”

Section: Introductionmentioning

confidence: 99%

LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression

Mao¹,

Wang²,

Wu³

et al. 2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

BERT is a cutting-edge language representation model pre-trained by a large corpus, which achieves superior performances on various natural language understanding tasks. However, a major blocking issue of applying BERT to online services is that it is memory-intensive and leads to unsatisfactory latency of user requests. Existing solutions leverage knowledge distillation frameworks to learn smaller models that imitate the behaviors of BERT. However, the training procedure of knowledge distillation is expensive itself as it requires sufficient training data to imitate the teacher model. In this paper, we address this issue by proposing a hybrid solution named LadaBERT (Lightweight adaptation of BERT through hybrid model compression), which combines the advantages of different model compression methods, including weight pruning, matrix factorization and knowledge distillation. LadaBERT achieves state-of-the-art accuracy on various public datasets while the training overheads can be reduced by an order of magnitude.

show abstract

TinyBERT: Distilling BERT for Natural Language Understanding

Cited by 973 publications

References 19 publications

RecoBERT: A Catalog Language Model for Text-Based Recommendations

RecoBERT: A Catalog Language Model for Text-Based Recommendations

Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior

LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression

Contact Info

Product

Resources

About