2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.01164
|View full text |Cite
|
Sign up to set email alerts
|

Self-Distillation from the Last Mini-Batch for Consistency Regularization

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
24
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 51 publications
(24 citation statements)
references
References 25 publications
0
24
0
Order By: Relevance
“…The label refurbishment in Section 3.4.2 only utilizes discrete hard labels, overlooking the rich information in the continuous soft distributions. Here, we record the exponential moving average (EMA) of historical logits to impose temporal consistency regularization [ 32 , 33 ].…”
Section: Our Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…The label refurbishment in Section 3.4.2 only utilizes discrete hard labels, overlooking the rich information in the continuous soft distributions. Here, we record the exponential moving average (EMA) of historical logits to impose temporal consistency regularization [ 32 , 33 ].…”
Section: Our Methodsmentioning
confidence: 99%
“…As for theoretical analysis, there are various opinions that include label smoothing regularization [ 29 ], the multi-view hypothesis [ 30 ], and loss landscape flattening [ 31 ]. Similarly to our method, PS-KD [ 32 ] trained a model with soft targets, which were a weighted summation of the hard targets and the last-epoch predictions, and DLB [ 33 ] used predictions from the last iteration as soft targets. However, we considered the entire prediction history and maintained an exponential moving average of the predictions.…”
Section: Introductionmentioning
confidence: 99%
“…Additionally, α represents the momentum term, which determines the proportion of historical prediction participating in the t-th iteration. Shen et al [138] use half of each mini-batch for extracting smoothed labels from previous iterations and the other half for providing soft targets for self-regularization.…”
Section: Aspect Of Historical Typementioning
confidence: 99%
“…(4) Probability predictions. The recorded probability predictions of specific instances are adopted for ensembling [132]- [135] and self-teaching [136]- [138]. (5) Loss values.…”
Section: Introductionmentioning
confidence: 99%
“…Recently, there has been many knowledge distillation mechanisms (Gou et al, 2021) that vary in either the distillation mechanisms (e.g., offline distillation, online distillation, and self-distillation) or the types of knowledge distilled (e.g., logits, feature knowledge, and relational knowledge). Specifically, offline distillation utilises a two-stage approach that requires the teacher model to be pre-trained as a prior and fixed for knowledge transfer (Liu et al, 2021;Zhao et al, 2022;Mirzadeh et al, 2020;Wu et al, 2020;, while online and self-distillation instead argues that a pre-trained large teacher model is not always available for the tasks of interest, thus propose to update both the teacher model and the student model simultaneously in an end-toend manner (Yuan et al, 2020;Mobahi et al, 2020;Zhao et al, 2021;Xu et al, 2022;Shen et al, 2022;Gou et al, 2022;Li et al, 2018;Wu & Gong, 2021;Zhang et al, 2018).…”
Section: Introductionmentioning
confidence: 99%