Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations 2020
DOI: 10.18653/v1/2020.acl-demos.2
|View full text |Cite
|
Sign up to set email alerts
|

TextBrewer: An Open-Source Knowledge Distillation Toolkit for Natural Language Processing

Abstract: In this paper, we introduce TextBrewer, an open-source knowledge distillation toolkit designed for natural language processing. It works with different neural network models and supports various kinds of supervised learning tasks, such as text classification, reading comprehension, sequence labeling. TextBrewer provides a simple and uniform workflow that enables quick setting up of distillation experiments with highly flexible configurations. It offers a set of predefined distillation methods and can be extend… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
6
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3
2

Relationship

2
7

Authors

Journals

citations
Cited by 16 publications
(7 citation statements)
references
References 24 publications
1
6
0
Order By: Relevance
“…However, during experiments, we still observe that the weight of the KD loss α has a large impact on the model performances, and the better performances are achieved by setting α to be relatively small (e.g., 0.1 or 0.2). Our observations are consistent with the experimental observations of Yang et al (2020); Sun et al (2019b). Intuitively, it seems that the KD objective is conflicting with the CE loss to a certain degree.…”
Section: Gradient Alignmentsupporting
confidence: 92%
“…However, during experiments, we still observe that the weight of the KD loss α has a large impact on the model performances, and the better performances are achieved by setting α to be relatively small (e.g., 0.1 or 0.2). Our observations are consistent with the experimental observations of Yang et al (2020); Sun et al (2019b). Intuitively, it seems that the KD objective is conflicting with the CE loss to a certain degree.…”
Section: Gradient Alignmentsupporting
confidence: 92%
“…For future work, we will update TextPruner to support more pre-trained models, such as the generation model T5 (Raffel et al, 2020). We also plan to combine TextPruner with our previously released knowledge distillation toolkit TextBrewer (Yang et al, 2020) into a single framework to provide more effective model compression methods and a uniform interface for knowledge distillation and model pruning.…”
Section: Discussionmentioning
confidence: 99%
“…We take English, French and Chinese as the supervised languages 1 . For the monolingual teachers, we use RoBERTa for English (Liu et al 2019), CamemBERT for French (Martin et al 2020) and RoBERT-wwm-ext for Chinese (Cui et al 2020). All of the teachers are of the same structure with 24 layers, 16 attention heads, which are largely the same as BERT-large, except that they do not use token type embeddings.…”
Section: Methodsmentioning
confidence: 99%
“…We implement the model with Transformers (Wolf et al 2020). We train and distill the models with TextBrewer (Yang et al 2020). All the experiments were performed with a single V100 GPU.…”
Section: Methodsmentioning
confidence: 99%