Findings of the Association for Computational Linguistics: EMNLP 2020 2020
DOI: 10.18653/v1/2020.findings-emnlp.372
|View full text |Cite
|
Sign up to set email alerts
|

TinyBERT: Distilling BERT for Natural Language Understanding

Abstract: Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resourcerestricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By lever… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

4
882
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
4
1

Relationship

1
8

Authors

Journals

citations
Cited by 973 publications
(1,013 citation statements)
references
References 19 publications
4
882
1
Order By: Relevance
“…Notably, all these computations are applied once, for a given catalog, and can be executed in an offline manner and cached for later use. To further accelerate the computation time of the two C T DM scores applied through RecoBERT inference, one can adopt knowledge distillation techniques, such as (Barkan et al, 2019;Jiao et al, 2019;Lioutas et al, 2019), which are beyond the scope of this work.…”
Section: Computational Costsmentioning
confidence: 99%
“…Notably, all these computations are applied once, for a given catalog, and can be executed in an offline manner and cached for later use. To further accelerate the computation time of the two C T DM scores applied through RecoBERT inference, one can adopt knowledge distillation techniques, such as (Barkan et al, 2019;Jiao et al, 2019;Lioutas et al, 2019), which are beyond the scope of this work.…”
Section: Computational Costsmentioning
confidence: 99%
“…The knowledge distillation approach enables the transfer of knowledge from a large teacher model to a smaller student model. Such attempts have been made to distill BERT models, e.g., Distil-BERT (Sanh et al, 2019), BERT-PKD (Sun et al, 2019), Distilled BiLSTM (Tang et al, 2019), Tiny-BERT (Jiao et al, 2019), MobileBERT (Sun et al, 2020), etc. All of these methods require carefully designing the student architecture.…”
Section: Pre-trained Language Model Compressionmentioning
confidence: 99%
“…However, these models often consume considerable storage, memory bandwidth, and computational resource. To reduce the model size and increase the inference throughput, compression techniques such as knowledge distillation (Sanh et al, 2019;Sun et al, 2019;Tang et al, 2019;Jiao et al, 2019;Sun et al, 2020) (Sanh et al, 2019) and BERT-PKD (Sun et al, 2019)) and iterative pruning methods (Iterative Pruning (Guo et al, 2019) and our proposed method) in terms of accuracy at various compression rate using MNLI test set. knowledge distillation methods require re-distillation from the teacher to get each single data point, whereas iterative pruning methods can produce continuous curves at once.…”
Section: Introductionmentioning
confidence: 99%
“…Existing BERT-oriented model compression solutions largely depend on knowledge distillation (Hinton et al, 2015), which is inefficient and resource-consuming because a large training corpus is required to learn the behaviors of a teacher. For example, DistilBERT (Sanh et al, 2019) is re-trained on the same corpus as pre-training a vanilla BERT from scratch; and TinyBERT (Jiao et al, 2019) utilizes expensive data augmentation to fit the distillation target. The costs of these model compression methods are as large as pre-training, which are unaffordable for low-resource settings.…”
Section: Introductionmentioning
confidence: 99%