“…Unfortunately, during this near-decade of DL advancement, a fixation by the DL community toward deeper and more complicated architectures, as well as on traditional prediction performance evaluation measures, has led to an insidious DL model behavior manifesting overconfident predictions, 30 i.e., predictions made at a probability nearing 1, regardless of whether they are correct or not. Downstream software modules or policy makers making catastrophic decisions due to these overconfidently predicted misclassifications can foster deep mistrust in DL, 30 , 31 , 32 something that has also been noted with respect to bioacoustics. 3 However, early prediction calibration fixes 30 are based on learning a transformation of the model outputs that requires the existence of a validation set of labels, something that cannot be safely assumed in general.…”