2019
DOI: 10.48550/arxiv.1902.03393
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Improved Knowledge Distillation via Teacher Assistant

Abstract: Despite the fact that deep neural networks are powerful models and achieve appealing results on many tasks, they are too large to be deployed on edge devices like smartphones or embedded sensor nodes. There have been efforts to compress these networks, and a popular method is knowledge distillation, where a large (teacher) pre-trained network is used to train a smaller (student) network. However, in this paper, we show that the student network performance degrades when the gap between student and teacher is la… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
13
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 8 publications
(15 citation statements)
references
References 0 publications
0
13
0
Order By: Relevance
“…Yu et al [100] proposed two new loss functions to simulate the communication between the deep teacher network and the small student network: one is based on the absolute teachers, the other is based on the relative teacher network. Mirzadeh et al [101] introduced multi-step knowledge extraction technology and used a medium-sized network (teacher assistant) to fill the gap between students and teachers.…”
Section: A Knowledge From Logitsmentioning
confidence: 99%
See 1 more Smart Citation
“…Yu et al [100] proposed two new loss functions to simulate the communication between the deep teacher network and the small student network: one is based on the absolute teachers, the other is based on the relative teacher network. Mirzadeh et al [101] introduced multi-step knowledge extraction technology and used a medium-sized network (teacher assistant) to fill the gap between students and teachers.…”
Section: A Knowledge From Logitsmentioning
confidence: 99%
“…Knowledge from Details Hinton et al [97] Logits Cross Entropy Huang et al [99] Logits Cross Entropy and maximum mean discrepancy Yu et al [100] Logits Hints and attention Mirzadeh et al [101] Logits Use teacher assistant Romero et al [102] Intermediate layers MSEloss in a certain middle layer Yim et al [103] Intermediate layers Gram matrix loss in multiple middle layers Zagoruyko et al [104] Intermediate layers Attention transfer loss in multiple middle layers Zhang et al [105] Intermediate layers Adaptive selection a middle layer Peng et al [106] Mutual information Correlation between multiple instances Crowley et al [107] Self structures With the same structure, use cheap convolution blocks Park et al [108] Structured knowledge Use a relational potential function to transfers the information Lopez-Paz et al [109] Privileged information Use pair-wise distillation and holistic distillation between two neural networks and proposed an information theory framework for knowledge transfer. Tung et al [112] proposed a new form of KD loss, which was inspired by a similar input pattern in a well-trained network.…”
Section: Referencesmentioning
confidence: 99%
“…In [33], the capacity gap between the large teacher model and the student has been investigated. It shows that the relationships between the architecture of the teacher and student model are very important.…”
Section: A Image Classificationmentioning
confidence: 99%
“…Deep neural network (DNN)-driven algorithms now stand as the state of the art in a variety of domains, from perceptual tasks such as computer vision, speech and language processing to, more recently, control tasks such as robotics (Mirzadeh et al 2019), (Bastani, Pu, and Solar-Lezama 2018). Nevertheless, there is often reason to avoid direct use of DNNs.…”
Section: Introductionmentioning
confidence: 99%
“…For example, the training from scratch or hyperparameter tuning of such networks can be prohibitively expensive or time consuming (Schmitt et al 2018). For some applications, the size or complexity of such DNNs precludes their use in real time, or employment in edge devices with limited processing resources (Chen et al 2017), (Mirzadeh et al 2019). In other areas such as flight control or self driving cars, DNNs are sidelined (at least for mass deployment) by their opaqueness or lack of decision making interpretability (Bastani, Kim, and Bastani 2017), (Hind et al 2019).…”
Section: Introductionmentioning
confidence: 99%