Calibrating Deep Neural Networks using Focal Loss

Mukhoti, Jishnu; Kulharia, Viveka; Sanyal, Amartya; Golodetz, Stuart; Torr, Philip H. S.; Dokania, Puneet K.

doi:10.48550/arxiv.2002.09437

Cited by 18 publications

(29 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As explained earlier, this problem occurs due to two main reasons: overfitting in the final softmax layer and subdued tail-class activations. The overfitting and the resulting miscalibration in softmax probabilities is a known phenomenon observed in many modern multi-class classification networks [12,20]. This has been attributed to prolonged training to minimize the negative log-likelihood loss on a network with a large capacity.…”

Section: Our Methodsmentioning

confidence: 99%

“…To mitigate these problems, the overfit softmax layer of the model f θ is replaced with a structurally similar newly-initialized layer (recalibration layer). The intuition behind replacing only the last layer is based on the idea that overfitting and miscalibration are mainly attributed to weight magnification particularly in the last layer of the neural network [20]. The new layer is trained with early stopping and focal loss which help solve the problem of overfitting and subdued activations, respectively.…”

Section: Our Methodsmentioning

confidence: 99%

“…Focal loss, which is a dynamically weighted variant of the negative log-likelihood loss, helps the network to focus on statically-rare tail-class examples, and hence mitigate the problem of subdued activations. Additionally, focal loss also acts as an entropy-maximizing regularisation which has an added benefit further preventing the model to become overconfident on head-classes [20].…”

Section: Our Methodsmentioning

confidence: 99%

“…Second, even in the output space, the bias in the trained model due to class imbalance subdues tail-class activations, resulting in distortion of the model's uncertainty estimates. This effect is aggravated in deep neural networks where due to pervasive overfitting of the final softmax layer [12,20], head-class activations completely eclipse tail-class activations. Hence, the otherwise successful informative-sample mining methods in active learning literature, which rely on the model's uncertainties in the output space, are well-known to perform poorly in mining tail-class examples [1,2,3,6].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Mining Minority-class Examples With Uncertainty Estimates

Singh¹,

Chu²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

In the real world, the frequency of occurrence of objects is naturally skewed forming long-tail class distributions, which results in poor performance on the statistically rare classes. A promising solution is to mine tail-class examples to balance the training dataset. However, mining tail-class examples is a very challenging task. For instance, most of the otherwise successful uncertainty-based mining approaches struggle due to distortion of class probabilities resulting from skewness in data. In this work, we propose an effective, yet simple, approach to overcome these challenges. Our framework enhances the subdued tail-class activations and, thereafter, uses a one-class data-centric approach to effectively identify tail-class examples. We carry out an exhaustive evaluation of our framework on three datasets spanning over two computer vision tasks. Substantial improvements in the minority-class mining and fine-tuned model's task performance strongly corroborate the value of our method.

show abstract

Section: Our Methodsmentioning

confidence: 99%

Section: Our Methodsmentioning

confidence: 99%

Section: Our Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Mining Minority-class Examples With Uncertainty Estimates

Singh¹,

Chu²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…However, this raises a new concern, to what extend the network's predictions are likely to be correct? As these deep network tries to reduce the negative log-likelihood loss, they overfit to datasets, rendering its predictions to be over-confident and less trustworthy (Mukhoti et al, 2020). Here, the network is termed to be poorly calibrated.…”

Section: Introductionmentioning

confidence: 99%

Class-Distribution-Aware Calibration for Long-Tailed Visual Recognition

Islam,

Seenivasan,

Ren

et al. 2021

Preprint

View full text Add to dashboard Cite

Despite impressive accuracy, deep neural networks are often miscalibrated and tend to overly confident predictions. Recent techniques like temperature scaling (TS) and label smoothing (LS) show effectiveness in obtaining a well-calibrated model by smoothing logits and hard labels with scalar factors, respectively. However, the use of uniform TS or LS factor may not be optimal for calibrating models trained on a long-tailed dataset where the model produces overly confident probabilities for high-frequency classes. In this study, we propose class-distribution-aware TS (CDA-TS) and LS (CDA-LS) by incorporating class frequency information in model calibration in the context of long-tailed distribution. In CDA-TS, the scalar temperature value is replaced with the CDA temperature vector encoded with class frequency to compensate for the over-confidence. Similarly, CDA-LS uses a vector smoothing factor and flattens the hard labels according to their corresponding class distribution. We also integrate CDA optimal temperature vector with distillation loss, which reduces miscalibration in selfdistillation (SD). We empirically show that classdistribution-aware TS and LS can accommodate the imbalanced data distribution yielding superior performance in both calibration error and predictive accuracy. We also observe that SD with an extremely imbalanced dataset is less effective in terms of calibration performance. Code is available in https://github.com/mobarakol/Class-Distribution-Aware-TS-LS.

show abstract