TextBrewer: An Open-Source Knowledge Distillation Toolkit for Natural Language Processing

Yang, Ziqing; Cui, Yiming; Chen, Zhipeng; Che, Wanxiang; Liu, Ting; Wang, Shijin; Hu, Guoping

doi:10.18653/v1/2020.acl-demos.2

Cited by 16 publications

(7 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, during experiments, we still observe that the weight of the KD loss α has a large impact on the model performances, and the better performances are achieved by setting α to be relatively small (e.g., 0.1 or 0.2). Our observations are consistent with the experimental observations of Yang et al (2020); Sun et al (2019b). Intuitively, it seems that the KD objective is conflicting with the CE loss to a certain degree.…”

Section: Gradient Alignmentsupporting

confidence: 92%

GAML-BERT: Improving BERT Early Exiting by Gradient Aligned Mutual Learning

Zhu¹,

Wang²,

Ni³

2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

In this work, we propose a novel framework, Gradient Aligned Mutual Learning BERT (GAML-BERT), for improving the early exiting of BERT. GAML-BERT's contributions are two-fold. We conduct a set of pilot experiments, which shows that mutual knowledge distillation between a shallow exit and a deep exit leads to better performances for both. From this observation, we use mutual learning to improve BERT's early exiting performances, that is, we ask each exit of a multi-exit BERT to distill knowledge from each other. Second, we propose GA, a novel training method that aligns the gradients from knowledge distillation to cross-entropy losses. Extensive experiments are conducted on the GLUE benchmark, which shows that our GAML-BERT can significantly outperform the state-of-the-art (SOTA) BERT early exiting methods.

show abstract

Section: Gradient Alignmentsupporting

confidence: 92%

GAML-BERT: Improving BERT Early Exiting by Gradient Aligned Mutual Learning

Zhu¹,

Wang²,

Ni³

2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…For future work, we will update TextPruner to support more pre-trained models, such as the generation model T5 (Raffel et al, 2020). We also plan to combine TextPruner with our previously released knowledge distillation toolkit TextBrewer (Yang et al, 2020) into a single framework to provide more effective model compression methods and a uniform interface for knowledge distillation and model pruning.…”

Section: Discussionmentioning

confidence: 99%

TextPruner: A Model Pruning Toolkit for Pre-Trained Language Models

Yang¹,

Cui²,

Chen³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Pre-trained language models have been prevailed in natural language processing and become the backbones of many NLP tasks, but the demands for computational resources have limited their applications. In this paper, we introduce TextPruner, an open-source model pruning toolkit designed for pre-trained language models, targeting fast and easy model compression. TextPruner offers structured post-training pruning methods, including vocabulary pruning and transformer pruning, and can be applied to various models and tasks. We also propose a self-supervised pruning method that can be applied without the labeled data. Our experiments with several NLP tasks demonstrate the ability of TextPruner to reduce the model size without re-training the model. 1

show abstract

“…We take English, French and Chinese as the supervised languages 1 . For the monolingual teachers, we use RoBERTa for English (Liu et al 2019), CamemBERT for French (Martin et al 2020) and RoBERT-wwm-ext for Chinese (Cui et al 2020). All of the teachers are of the same structure with 24 layers, 16 attention heads, which are largely the same as BERT-large, except that they do not use token type embeddings.…”

Section: Methodsmentioning

confidence: 99%

“…We implement the model with Transformers (Wolf et al 2020). We train and distill the models with TextBrewer (Yang et al 2020). All the experiments were performed with a single V100 GPU.…”

Section: Methodsmentioning

confidence: 99%

Cross-Lingual Text Classification with Multilingual Distillation and Zero-Shot-Aware Training

Yang¹,

Cui²,

Chen³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Multilingual pre-trained language models (MPLMs) not only can handle tasks in different languages but also exhibit surprising zero-shot cross-lingual transferability. However, MPLMs usually are not able to achieve comparable supervised performance on rich-resource languages compared to the state-of-the-art monolingual pre-trained models. In this paper, we aim to improve the multilingual model's supervised and zero-shot performance simultaneously only with the resources from supervised languages. Our approach is based on transferring knowledge from high-performance monolingual models with a teacher-student framework. We let the multilingual model learn from multiple monolingual models simultaneously. To exploit the model's cross-lingual transferability, we propose MBLM (multi-branch multilingual language model), a model built on the MPLMs with multiple language branches. Each branch is a stack of transformers. MBLM is trained with the zero-shot-aware training strategy that encourages the model to learn from the mixture of zero-shot representations from all the branches. The results on two crosslingual classification tasks show that, with only the task's supervised data used, our method improves both the supervised and zero-shot performance of MPLMs.

show abstract

TextBrewer: An Open-Source Knowledge Distillation Toolkit for Natural Language Processing

Cited by 16 publications

References 24 publications

GAML-BERT: Improving BERT Early Exiting by Gradient Aligned Mutual Learning

GAML-BERT: Improving BERT Early Exiting by Gradient Aligned Mutual Learning

TextPruner: A Model Pruning Toolkit for Pre-Trained Language Models

Cross-Lingual Text Classification with Multilingual Distillation and Zero-Shot-Aware Training

Contact Info

Product

Resources

About