Purpose
Within the medical field, there is a growing demand for deep learning (DL) models which convey model certainty to the end-user, while maintaining alignment between model accuracy and certainty. For oropharyngeal cancer (OPC) primary tumour (PT) segmentation in PET/CT images, an ensemble-based DL model was developed which outputs tumour probability maps (TPM) showing voxel-level predicted probabilities. This study extended the network to generate TPMs for both PT and pathologic lymph nodes (PL) and explored whether quantified uncertainty in TPMs can predict segmentation model accuracy in an independent external cohort.
Methods
We gathered PET/CT images and manual delineations of gross tumour volume of the PT (GTVp) and PL (GTVln) of 405 OPC patients treated with (chemo)radiation in our institute in 2010-2022. The publicly available 2022 HECKTOR challenge dataset served as external test set. The existing network was adapted to perform multi-label segmentation, training 15 models and considering the ensemble average of predicted TPMs per patient. Surface and aggregate DSC were computed for the predicted contours at different probability thresholds. Uncertainty was quantified by coefficient of variation (CV) of multiple predictions.
Results
GTVln segmentation showed slightly lower performance than GTVp: aggregate DSC of 0.66 and 0.67 in the internal, 0.70 and 0.75 in the external test sets. However, a significant negative correlation (about -0.6) was observed in both test sets between Average Surface DSC and CV for both structures, indicating a significant calibration.
Conclusion
Significant accuracy versus uncertainty calibration was achieved for TPMs in both internal and external test sets, indicating the potential use of CV to identify cases with lower GTVp and GTVln segmentation accuracy, independently of the dataset.