Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.633
|View full text |Cite
|
Sign up to set email alerts
|

BERT-of-Theseus: Compressing BERT by Progressive Module Replacing

Abstract: In this paper, we propose a novel model compression approach to effectively compress BERT by progressive module replacing. Our approach first divides the original BERT into several modules and builds their compact substitutes. Then, we randomly replace the original modules with their substitutes to train the compact modules to mimic the behavior of the original modules. We progressively increase the probability of replacement through the training. In this way, our approach brings a deeper level of interaction … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
71
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 128 publications
(71 citation statements)
references
References 20 publications
(37 reference statements)
0
71
0
Order By: Relevance
“…In this paper, we compare our BERT-EMD with several state-of-the-art BERT compression approaches, including the original 4/6-layer BERT models (Devlin et al, 2018), DistilBERT (Tang et al, 2019), BERT-PKD , Tiny-BERT (Jiao et al, 2019), BERT-of-Theseus (Xu et al, 2020). However, the original TinyBERT employs a data augmentation strategy in the training process, which is different from the other baseline models.…”
Section: Baseline Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…In this paper, we compare our BERT-EMD with several state-of-the-art BERT compression approaches, including the original 4/6-layer BERT models (Devlin et al, 2018), DistilBERT (Tang et al, 2019), BERT-PKD , Tiny-BERT (Jiao et al, 2019), BERT-of-Theseus (Xu et al, 2020). However, the original TinyBERT employs a data augmentation strategy in the training process, which is different from the other baseline models.…”
Section: Baseline Methodsmentioning
confidence: 99%
“…Jiao et al (2019) proposed the TinyBERT model, which performed the Transformer distillation at both pre-training and fine-tuning processes. Xu et al (2020) proposed the BERT-of-Theseus model to learn a compact student network by replacing the teacher layers with their substitutes. Sun et al (2020) introduced the MobileBERT model, which has the same number of layers with the teacher network, but was much narrower via adopting bottleneck structures.…”
Section: Related Workmentioning
confidence: 99%
“…At present, the MobileBERT network defines the state-of-the-art in low-latency text classification for mobile devices (Sun et al, 2020). MobileBERT takes approximately 0.6 seconds to classify a text sequence on a Google Pixel 3 smartphone while achieving higher accuracy on the GLUE benchmark, which consists of 9 natural language understanding (NLU) datasets (Wang et al, 2018), than other efficient networks such as Distil-BERT , PKD , and several others (Lan et al, 2019;Turc et al, 2019;Jiao et al, 2019;Xu et al, 2020). To achieve this, MobileBERT introduced two concepts into their NLP self-attention network that are already in widespread use in CV neural networks:…”
Section: What Has CV Research Already Taughtmentioning
confidence: 99%
“…While the term "knowledge distillation" was coined by Hinton et al to describe a specific method and equation (Hinton et al, 2015), the term "distillation" is now used in reference to a diverse range of approaches where a "student" network is trained to replicate a "teacher" network. Some researchers distill only the final layer of the network , while others also distill the hidden layers (Sun et al, , 2020Xu et al, 2020). When distilling the hidden layers, some apply layer-by-layer distillation warmup, where each module of the student network is distilled independently while downstream modules are frozen (Sun et al, 2020).…”
Section: Training With Bells and Whistlesmentioning
confidence: 99%
See 1 more Smart Citation