2022
DOI: 10.48550/arxiv.2210.01213
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Robust Active Distillation

Abstract: Distilling knowledge from a large teacher model to a lightweight one is a widely successful approach for generating compact, powerful models in the semi-supervised learning setting where a limited amount of labeled data is available. In large-scale applications, however, the teacher tends to provide a large number of incorrect soft-labels that impairs student performance. The sheer size of the teacher additionally constrains the number of soft-labels that can be queried due to prohibitive computational and/or … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 23 publications
0
1
0
Order By: Relevance
“…where is some loss function (usually, regularized cross-entropy loss for classification problems), y T is the teacher's predicted label, y is the given label on which the teacher is trained, y S (θ) is the prediction of the student model parameterized by θ, and ξ ∈ [0, 1] is known as the imitation parameter [Lopez-Paz et al, 2015] 1 . KD and its variants have been shown to be beneficial for model compression (i.e., distilling a bigger teacher model's knowledge into a smaller student model), semi-supervised learning, making models robust and improving performance in general [Li et al, 2017, Furlanello et al, 2018, Sun et al, 2019, Ahn et al, 2019, Xie et al, 2020, Sarfraz et al, 2021, Li et al, 2021, Pham et al, 2021, Beyer et al, 2022, Baykal et al, 2022; see [Gou et al, 2021] for a survey on KD. The focus of this work is on the special case of the student and teacher having the same architecture, which is known as self-distillation (following [Mobahi et al, 2020]); we abbreviate it as SD henceforth.…”
Section: Introductionmentioning
confidence: 99%
“…where is some loss function (usually, regularized cross-entropy loss for classification problems), y T is the teacher's predicted label, y is the given label on which the teacher is trained, y S (θ) is the prediction of the student model parameterized by θ, and ξ ∈ [0, 1] is known as the imitation parameter [Lopez-Paz et al, 2015] 1 . KD and its variants have been shown to be beneficial for model compression (i.e., distilling a bigger teacher model's knowledge into a smaller student model), semi-supervised learning, making models robust and improving performance in general [Li et al, 2017, Furlanello et al, 2018, Sun et al, 2019, Ahn et al, 2019, Xie et al, 2020, Sarfraz et al, 2021, Li et al, 2021, Pham et al, 2021, Beyer et al, 2022, Baykal et al, 2022; see [Gou et al, 2021] for a survey on KD. The focus of this work is on the special case of the student and teacher having the same architecture, which is known as self-distillation (following [Mobahi et al, 2020]); we abbreviate it as SD henceforth.…”
Section: Introductionmentioning
confidence: 99%