2021
DOI: 10.1609/aaai.v35i15.17610
|View full text |Cite
|
Sign up to set email alerts
|

ALP-KD: Attention-Based Layer Projection for Knowledge Distillation

Abstract: Knowledge distillation is considered as a training and compression strategy in which two neural networks, namely a teacher and a student, are coupled together during training. The teacher network is supposed to be a trustworthy predictor and the student tries to mimic its predictions. Usually, a student with a lighter architecture is selected so we can achieve compression and yet deliver high-quality results. In such a setting, distillation only happens for final predictions whereas the student could also bene… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
19
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 64 publications
(24 citation statements)
references
References 17 publications
0
19
0
Order By: Relevance
“…The original knowledge is the original large DL model, which is referred to as the teacher model. The knowledge distillation algorithm is used to transfer knowledge from the teacher model to the smaller student model using techniques such as Adversarial KD [130,131], Multi-Teacher KD [132,133,134], Cross-modal KD [135,136], Attentionbased KD [137,138,139,140], Lifelong KD [141,142] and Quantized KD [143,144]. Finally, the teacher-student architecture is used to train the student model.…”
Section: C) Knowledge Distillationmentioning
confidence: 99%
“…The original knowledge is the original large DL model, which is referred to as the teacher model. The knowledge distillation algorithm is used to transfer knowledge from the teacher model to the smaller student model using techniques such as Adversarial KD [130,131], Multi-Teacher KD [132,133,134], Cross-modal KD [135,136], Attentionbased KD [137,138,139,140], Lifelong KD [141,142] and Quantized KD [143,144]. Finally, the teacher-student architecture is used to train the student model.…”
Section: C) Knowledge Distillationmentioning
confidence: 99%
“…Recently, there have been several breakthroughs [ 10 , 11 , 12 , 13 ] related to the compression of BERT models in the pre-training stage, which is also called task-agnostic distillation [ 13 ]. To prevent re-building a pre-trained language model, researchers [ 14 , 15 ] are seeking an alternative that can directly distill knowledge from a teacher model for a downstream task, such as task-specific distillation [ 13 ]. In this way, given a downstream task, the teacher is the BERT model that was fine-tuned on the task, and the goal of the student model is to mimic the outputs of the teacher during the given task.…”
Section: Introductionmentioning
confidence: 99%
“…To fix this problem in PKD [ 14 ], instead of skipping some teacher layers, Passban et al [ 15 ] proposed Attention-Based Layer Projection for Knowledge Distillation (ALP-KD) to optimize the student model with all layers in the teacher model. However, each layer in BERT [ 1 ] plays a role in the NLP pipeline [ 16 ].…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations