2020
DOI: 10.1609/aaai.v34i04.5963
|View full text |Cite
|
Sign up to set email alerts
|

Improved Knowledge Distillation via Teacher Assistant

Abstract: Despite the fact that deep neural networks are powerful models and achieve appealing results on many tasks, they are too large to be deployed on edge devices like smartphones or embedded sensor nodes. There have been efforts to compress these networks, and a popular method is knowledge distillation, where a large (teacher) pre-trained network is used to train a smaller (student) network. However, in this paper, we show that the student network performance degrades when the gap between student and teacher is la… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
289
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
9
1

Relationship

0
10

Authors

Journals

citations
Cited by 700 publications
(292 citation statements)
references
References 10 publications
(16 reference statements)
2
289
1
Order By: Relevance
“…When it comes to the effects of knowledge definition, the structural differences between the networks are very important. [25] also finds that networks with similar structures are easier to transfer knowledge. Therefore, in this paper, the knowledge distillation between the sub-networks of the original network is used to minimize the structural differences.…”
Section: Knowledge Distillationmentioning
confidence: 90%
“…When it comes to the effects of knowledge definition, the structural differences between the networks are very important. [25] also finds that networks with similar structures are easier to transfer knowledge. Therefore, in this paper, the knowledge distillation between the sub-networks of the original network is used to minimize the structural differences.…”
Section: Knowledge Distillationmentioning
confidence: 90%
“…To alleviate the capacity mismatching problem, Ref. [ 35 ] introduces multi-step KD, which uses an intermediate-sized model (teacher assistant) to bridge the gap between the student and teacher. Route Constrained Optimization (RCO) [ 36 ] supervises the student model with some anchor points selected from the route in parameter space that the teacher pass by, instead of the converged teacher model.…”
Section: Related Workmentioning
confidence: 99%
“…In 2019, Mirzadeh et al [53] demonstrated that any desired teacher network is only capable of distilling knowledge to a student model with a specific threshold of parameters. If the size of the model is less than that threshold, the KD procedure may not be compelling.…”
Section: B Multi-teacher Kdmentioning
confidence: 99%