“…where is some loss function (usually, regularized cross-entropy loss for classification problems), y T is the teacher's predicted label, y is the given label on which the teacher is trained, y S (θ) is the prediction of the student model parameterized by θ, and ξ ∈ [0, 1] is known as the imitation parameter [Lopez-Paz et al, 2015] 1 . KD and its variants have been shown to be beneficial for model compression (i.e., distilling a bigger teacher model's knowledge into a smaller student model), semi-supervised learning, making models robust and improving performance in general [Li et al, 2017, Furlanello et al, 2018, Sun et al, 2019, Ahn et al, 2019, Xie et al, 2020, Sarfraz et al, 2021, Li et al, 2021, Pham et al, 2021, Beyer et al, 2022, Baykal et al, 2022; see [Gou et al, 2021] for a survey on KD. The focus of this work is on the special case of the student and teacher having the same architecture, which is known as self-distillation (following [Mobahi et al, 2020]); we abbreviate it as SD henceforth.…”