2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.01052
|View full text |Cite
|
Sign up to set email alerts
|

Refine Myself by Teaching Myself: Feature Refinement via Self-Knowledge Distillation

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
38
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 119 publications
(58 citation statements)
references
References 23 publications
0
38
0
Order By: Relevance
“…Knowledge distillation is first proposed by Hinton (Hinton et al, 2015 ) to compress the model by transferring the knowledge from a cumbersome teacher network to a compact student network. Recently, knowledge distillation is also applied to the model enhancement through improved learning strategy [including self-learning (Ji et al, 2021 ; Zheng and Peng, 2022 ) and mutual learning (Li et al, 2021 )]. For example, Hong et al ( 2020 ) applies knowledge distillation to heterogeneous task imitation and guides the student network training using the features extracted from the image reconstruction task.…”
Section: Related Workmentioning
confidence: 99%
“…Knowledge distillation is first proposed by Hinton (Hinton et al, 2015 ) to compress the model by transferring the knowledge from a cumbersome teacher network to a compact student network. Recently, knowledge distillation is also applied to the model enhancement through improved learning strategy [including self-learning (Ji et al, 2021 ; Zheng and Peng, 2022 ) and mutual learning (Li et al, 2021 )]. For example, Hong et al ( 2020 ) applies knowledge distillation to heterogeneous task imitation and guides the student network training using the features extracted from the image reconstruction task.…”
Section: Related Workmentioning
confidence: 99%
“…(1) KD first requires training a large DNN as the teacher; (2) when training the student, KD needs to process each sample twice in each iteration, once by the teacher and once by the student. To reduce the cost of training a large teacher, many self-distillation approaches including but not limited to (Xu and Liu 2019;Zhang et al 2019;Yang et al 2019b,a;Furlanello et al 2018;Bagherinezhad et al 2018;Yun et al 2020;Deng and Zhang 2021c;Ji et al 2021) have been proposed. Zhang et al (2019) and Ji et al (2021) add additional layers or parameters to a DNN to generate soft labels, which improves the performance but introduces large computation and memory cost.…”
Section: Related Workmentioning
confidence: 99%
“…To reduce the cost of training a large teacher, many self-distillation approaches including but not limited to (Xu and Liu 2019;Zhang et al 2019;Yang et al 2019b,a;Furlanello et al 2018;Bagherinezhad et al 2018;Yun et al 2020;Deng and Zhang 2021c;Ji et al 2021) have been proposed. Zhang et al (2019) and Ji et al (2021) add additional layers or parameters to a DNN to generate soft labels, which improves the performance but introduces large computation and memory cost. Born-again networks (Furlanello et al 2018;Yang et al 2019a) and label-refine networks (Bagherinezhad et al 2018) train a DNN for many generations and the network in the (i − 1)th generation is used as the teacher to train the network in the ith generation.…”
Section: Related Workmentioning
confidence: 99%
“…There are two categories of self-distillation techniques, namely the auxiliary parameter methods [2,8,13,30,33] and contrastive sample methods [14,27,29,31]. Auxiliary model methods exploit additional branches to get extra predictions besides the main-branch prediction for soft label supervision at the cost of more parameters overhead.…”
Section: Self Distillationmentioning
confidence: 99%
“…Self distillation simplifies the two-stage knowledge distillation framework by distilling knowledge from itself instead of from the pretrained teacher, and still [7,12]. b) Auxiliary parameters [8,13,33]. c) Progressive distillation with memory bank storing past predictions of the entire dataset [2,14,31].…”
Section: Introductionmentioning
confidence: 99%