BERT-of-Theseus: Compressing BERT by Progressive Module Replacing

Xu, Canwen; Zhou, Wangchunshu; Ge, Tao; Wei, Furu; Zhou, Ming

doi:10.18653/v1/2020.emnlp-main.633

Cited by 128 publications

(71 citation statements)

References 20 publications

(37 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this paper, we compare our BERT-EMD with several state-of-the-art BERT compression approaches, including the original 4/6-layer BERT models (Devlin et al, 2018), DistilBERT (Tang et al, 2019), BERT-PKD , Tiny-BERT (Jiao et al, 2019), BERT-of-Theseus (Xu et al, 2020). However, the original TinyBERT employs a data augmentation strategy in the training process, which is different from the other baseline models.…”

Section: Baseline Methodsmentioning

confidence: 99%

“…Jiao et al (2019) proposed the TinyBERT model, which performed the Transformer distillation at both pre-training and fine-tuning processes. Xu et al (2020) proposed the BERT-of-Theseus model to learn a compact student network by replacing the teacher layers with their substitutes. Sun et al (2020) introduced the MobileBERT model, which has the same number of layers with the teacher network, but was much narrower via adopting bottleneck structures.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover’s Distance

Li¹,

Liu²,

Zhao³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Pre-trained language models (e.g., BERT) have achieved significant success in various natural language processing (NLP) tasks. However, high storage and computational costs obstruct pre-trained language models to be effectively deployed on resourceconstrained devices. In this paper, we propose a novel BERT distillation method based on many-to-many layer mapping, which allows each intermediate student layer to learn from any intermediate teacher layers. In this way, our model can learn from different teacher layers adaptively for various NLP tasks. In addition, we leverage Earth Mover's Distance (EMD) to compute the minimum cumulative cost that must be paid to transform knowledge from teacher network to student network. EMD enables the effective matching for many-to-many layer mapping. Furthermore, we propose a cost attention mechanism to learn the layer weights used in EMD automatically, which is supposed to further improve the model's performance and accelerate convergence time. Extensive experiments on GLUE benchmark demonstrate that our model achieves competitive performance compared to strong competitors in terms of both accuracy and model compression. For reproducibility, we release the code and data at https:

show abstract

Section: Baseline Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover’s Distance

Li¹,

Liu²,

Zhao³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

show abstract

“…At present, the MobileBERT network defines the state-of-the-art in low-latency text classification for mobile devices (Sun et al, 2020). MobileBERT takes approximately 0.6 seconds to classify a text sequence on a Google Pixel 3 smartphone while achieving higher accuracy on the GLUE benchmark, which consists of 9 natural language understanding (NLU) datasets (Wang et al, 2018), than other efficient networks such as Distil-BERT , PKD , and several others (Lan et al, 2019;Turc et al, 2019;Jiao et al, 2019;Xu et al, 2020). To achieve this, MobileBERT introduced two concepts into their NLP self-attention network that are already in widespread use in CV neural networks:…”

Section: What Has CV Research Already Taughtmentioning

confidence: 99%

“…While the term "knowledge distillation" was coined by Hinton et al to describe a specific method and equation (Hinton et al, 2015), the term "distillation" is now used in reference to a diverse range of approaches where a "student" network is trained to replicate a "teacher" network. Some researchers distill only the final layer of the network , while others also distill the hidden layers (Sun et al, , 2020Xu et al, 2020). When distilling the hidden layers, some apply layer-by-layer distillation warmup, where each module of the student network is distilled independently while downstream modules are frozen (Sun et al, 2020).…”

Section: Training With Bells and Whistlesmentioning

confidence: 99%

“…When distilling the hidden layers, some apply layer-by-layer distillation warmup, where each module of the student network is distilled independently while downstream modules are frozen (Sun et al, 2020). Some distill during pretraining (Sun et al, 2020;, some distill during finetuning (Xu et al, 2020), and some do both Jiao et al, 2019).…”

Section: Training With Bells and Whistlesmentioning

confidence: 99%

See 1 more Smart Citation

SqueezeBERT: What can computer vision teach NLP about efficient neural networks?

Iandola¹,

Shaw²,

Krishna³

et al. 2020

Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing

View full text Add to dashboard Cite

Humans read and write hundreds of billions of messages every day. Further, due to the availability of large datasets, large computing systems, and better neural network models, natural language processing (NLP) technology has made significant strides in understanding, proofreading, and organizing these messages. Thus, there is a significant opportunity to deploy NLP in myriad applications to help web users, social networks, and businesses. Toward this end, we consider smartphones and other mobile devices as crucial platforms for deploying NLP models at scale. However, today's highly-accurate NLP neural network models such as BERT and RoBERTa are extremely computationally expensive, with BERT-base taking 1.7 seconds to classify a text snippet on a Pixel 3 smartphone. To begin to address this problem, we draw inspiration from the computer vision community, where work such as MobileNet has demonstrated that grouped convolutions (e.g., depthwise convolutions) can enable speedups without sacrificing accuracy. We demonstrate how to replace several operations in self-attention layers with grouped convolutions and use this technique in a novel network architecture called Squeeze-BERT, which runs 4.3x faster than BERT-base on the Pixel 3 while achieving competitive accuracy on the GLUE test set.A PyTorch-based implementation of Squeeze-BERT is available as part of the Hugging Face Transformers library: https:// huggingface.co/squeezebert

show abstract