Application of simultaneous uncertainty quantification for image segmentation with probabilistic deep learning: Performance benchmarking of oropharyngeal cancer target delineation as a use-case

Sahlsten, Jaakko; Jaskari, Joel; Wahid, Kareem A.; Ahmed, Sara; Glerean, Enrico; He, Renjie; Kann, Benjamin H.; Mäkitie, Antti; Fuller, Clifton D.; Naser, Mohamed A.; Kaski, Kimmo

doi:10.1101/2023.02.20.23286188

Cited by 5 publications

(7 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Secondly, this study employed only one metric to quantify uncertainty. The choice was made because this metric aligns well with our threshold-based approach and was shown to be the preferred choice for GTVp segmentation tasks by Sahlsten et al [5]. Nevertheless, a broader exploration of patient-specific characteristics is required to gain a comprehensive understanding of their effects on DL segmentation model results.…”

Section: Discussionmentioning

confidence: 98%

See 1 more Smart Citation

Uncertainty-Aware Deep Learning for Segmentation of Primary Tumour and Pathologic Lymph Nodes in Oropharyngeal Cancer: Insights from a Multi-Centre Cohort

De Biase,

Sijtsema,

Dijk

et al. 2024

Preprint

View full text Add to dashboard Cite

Purpose Within the medical field, there is a growing demand for deep learning (DL) models which convey model certainty to the end-user, while maintaining alignment between model accuracy and certainty. For oropharyngeal cancer (OPC) primary tumour (PT) segmentation in PET/CT images, an ensemble-based DL model was developed which outputs tumour probability maps (TPM) showing voxel-level predicted probabilities. This study extended the network to generate TPMs for both PT and pathologic lymph nodes (PL) and explored whether quantified uncertainty in TPMs can predict segmentation model accuracy in an independent external cohort. Methods We gathered PET/CT images and manual delineations of gross tumour volume of the PT (GTVp) and PL (GTVln) of 405 OPC patients treated with (chemo)radiation in our institute in 2010-2022. The publicly available 2022 HECKTOR challenge dataset served as external test set. The existing network was adapted to perform multi-label segmentation, training 15 models and considering the ensemble average of predicted TPMs per patient. Surface and aggregate DSC were computed for the predicted contours at different probability thresholds. Uncertainty was quantified by coefficient of variation (CV) of multiple predictions. Results GTVln segmentation showed slightly lower performance than GTVp: aggregate DSC of 0.66 and 0.67 in the internal, 0.70 and 0.75 in the external test sets. However, a significant negative correlation (about -0.6) was observed in both test sets between Average Surface DSC and CV for both structures, indicating a significant calibration. Conclusion Significant accuracy versus uncertainty calibration was achieved for TPMs in both internal and external test sets, indicating the potential use of CV to identify cases with lower GTVp and GTVln segmentation accuracy, independently of the dataset.

show abstract

Section: Discussionmentioning

confidence: 98%

“…To quantify uncertainty, Sahlsten et al demonstrated that the coefficient of variation (CV) may be optimal as an uncertainty measure in OPC tumour segmentation [5]. This metric quantifies the variation of the volume across multiple predictions, and it showed to be negatively correlated with the DSC in GTVp segmentation.…”

Section: Uncertainty Quantificationmentioning

confidence: 99%

Uncertainty-Aware Deep Learning for Segmentation of Primary Tumour and Pathologic Lymph Nodes in Oropharyngeal Cancer: Insights from a Multi-Centre Cohort

De Biase,

Sijtsema,

Dijk

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…Importantly, ensembling (e.g., through cross-validation schemes) is becoming increasingly common for many DL solutions [9] . We have previously benchmarked ensembling under a U-net framework for uncertainty estimation in oropharyngeal cancer auto-segmentation and have shown its efficacy [10] . Interestingly, Outeiral et al use cross-validation within their study for robustness analysis; merging their cross-validation outputs into an ensemble could have improved calibration when employing their HiS metric.…”

mentioning

confidence: 99%

“…Finally, we would like to note that the proposed HiS metric, if used to measure uncertainty, may be unable to disentangle epistemic uncertainty (i.e., intrinsic model uncertainty) and aleatoric uncertainty (i.e., extrinsic statistical uncertainty) [12] . While the same can be said of general measures of entropy, there exist alternative entropy-related uncertainty metrics, like expected entropy and mutual information, that could distinguish the source of the uncertainty when combined with an approximate Bayesian approach [10] , [13] . Moreover, when the distribution of DL network parameters is assumed to be a delta distribution, e.g., in a conventional DL network, the epistemic uncertainty is implicitly assumed to be non-existent.…”

mentioning

confidence: 99%