Robust Active Distillation

Baykal, Cenk; Trinh, Khoa; Iliopoulos, Fotis; Menghani, Gaurav; Vee, Erik

doi:10.48550/arxiv.2210.01213

Cited by 1 publication

(1 citation statement)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…where is some loss function (usually, regularized cross-entropy loss for classification problems), y T is the teacher's predicted label, y is the given label on which the teacher is trained, y S (θ) is the prediction of the student model parameterized by θ, and ξ ∈ [0, 1] is known as the imitation parameter [Lopez-Paz et al, 2015] 1 . KD and its variants have been shown to be beneficial for model compression (i.e., distilling a bigger teacher model's knowledge into a smaller student model), semi-supervised learning, making models robust and improving performance in general [Li et al, 2017, Furlanello et al, 2018, Sun et al, 2019, Ahn et al, 2019, Xie et al, 2020, Sarfraz et al, 2021, Li et al, 2021, Pham et al, 2021, Beyer et al, 2022, Baykal et al, 2022; see [Gou et al, 2021] for a survey on KD. The focus of this work is on the special case of the student and teacher having the same architecture, which is known as self-distillation (following [Mobahi et al, 2020]); we abbreviate it as SD henceforth.…”

Section: Introductionmentioning

confidence: 99%

Understanding Self-Distillation in the Presence of Label Noise

Das¹,

Sanghavi²

2023

Preprint

View full text Add to dashboard Cite

Self-distillation (SD) is the process of first training a "teacher" model and then using its predictions to train a "student" model with the same architecture. Specifically, the student's objective function is ξ * (teacher's predictions, student's predictions) + (1 − ξ) * (given labels, student's predictions) , where is some loss function and ξ is some parameter ∈ [0, 1]. Empirically, SD has been observed to provide performance gains in several settings. In this paper, we theoretically characterize the effect of SD in two supervised learning problems with noisy labels. We first analyze SD for regularized linear regression and show that in the high label noise regime, the optimal value of ξ that minimizes the expected error in estimating the ground truth parameter is surprisingly greater than 1. Empirically, we show that ξ > 1 works better than ξ ≤ 1 even with the cross-entropy loss for several classification datasets when 50% or 30% of the labels are corrupted. Further, we quantify when optimal SD is better than optimal regularization. Next, we analyze SD in the case of logistic regression for binary classification with random label corruption and quantify the range of label corruption in which the student outperforms the teacher in terms of accuracy. To our knowledge, this is the first result of its kind for the cross-entropy loss.1 In this work, we set the temperature parameter suggested in [Hinton et al., 2015] equal to 1.

show abstract