E pluribus unum: prospective acceptability benchmarking from the Contouring Collaborative for Consensus in Radiation Oncology crowdsourced initiative for multiobserver segmentation

Lin, Diana; Wahid, Kareem A.; Nelms, Benjamin E.; He, Renjie; Naser, Mohammed Abdullah; Duke, Simon; Sherer, Michael V.; Christodouleas, John P.; Mohamed, Abdallah S.R.; Cislo, Michael; Murphy, James D.; Fuller, C.D.; Gillespie, Erin F.

doi:10.1117/1.jmi.10.s1.s11903

Cited by 11 publications

(15 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The current standard of care for OPC RT relies on manually generated segmentation of the primary gross tumor volume (GTVp) by clinical experts as a target structure to deliver RT dose. However, the GTVp in OPC is notorious for being one of the most difficult structures amongst all cancer types to perform accurate segmentation for RT planning due to its exceptionally high interobserver variability 2–4 . Subsequently, GTVp segmentation has been cited as the single largest factor of uncertainty in RT planning 5,6 .…”

Section: Introductionmentioning

confidence: 99%

Application of simultaneous uncertainty quantification for image segmentation with probabilistic deep learning: Performance benchmarking of oropharyngeal cancer target delineation as a use-case

Sahlsten

Jaskari

Wahid³

et al. 2023

Preprint

View full text Add to dashboard Cite

Background: Oropharyngeal cancer (OPC) is a widespread disease, with radiotherapy being a core treatment modality. Manual segmentation of the primary gross tumor volume (GTVp) is currently employed for OPC radiotherapy planning, but is subject to significant interobserver variability. Deep learning (DL) approaches have shown promise in automating GTVp segmentation, but comparative (auto)confidence metrics of these models predictions has not been well-explored. Quantifying instance-specific DL model uncertainty is crucial to improving clinician trust and facilitating broad clinical implementation. Therefore, in this study, probabilistic DL models for GTVp auto-segmentation were developed using large-scale PET/CT datasets, and various uncertainty auto-estimation methods were systematically investigated and benchmarked. Methods: We utilized the publicly available 2021 HECKTOR Challenge training dataset with 224 co-registered PET/CT scans of OPC patients with corresponding GTVp segmentations as a development set. A separate set of 67 co-registered PET/CT scans of OPC patients with corresponding GTVp segmentations was used for external validation. Two approximate Bayesian deep learning methods, the MC Dropout Ensemble and Deep Ensemble, both with five submodels, were evaluated for GTVp segmentation and uncertainty performance. The segmentation performance was evaluated using the volumetric Dice similarity coefficient (DSC), mean surface distance (MSD), and Hausdorff distance at 95% (95HD). The uncertainty was evaluated using four measures from literature: coefficient of variation (CV), structure expected entropy, structure predictive entropy, and structure mutual information, and additionally with our novel Dice-risk measure. The utility of uncertainty information was evaluated with the accuracy of uncertainty-based segmentation performance prediction using the Accuracy vs Uncertainty (AvU) metric, and by examining the linear correlation between uncertainty estimates and DSC. In addition, batch-based and instance-based referral processes were examined, where the patients with high uncertainty were rejected from the set. In the batch referral process, the area under the referral curve with DSC (R-DSC AUC) was used for evaluation, whereas in the instance referral process, the DSC at various uncertainty thresholds were examined. Results: Both models behaved similarly in terms of the segmentation performance and uncertainty estimation. Specifically, the MC Dropout Ensemble had 0.776 DSC, 1.703 mm MSD, and 5.385 mm 95HD. The Deep Ensemble had 0.767 DSC, 1.717 mm MSD, and 5.477 mm 95HD. The uncertainty measure with the highest DSC correlation was structure predictive entropy with correlation coefficients of 0.699 and 0.692 for the MC Dropout Ensemble and the Deep Ensemble, respectively. The highest AvU value was 0.866 for both models. The best performing uncertainty measure for both models was the CV which had R-DSC AUC of 0.783 and 0.782 for the MC Dropout Ensemble and Deep Ensemble, respectively. With referring patients based on uncertainty thresholds from 0.85 validation DSC for all uncertainty measures, on average the DSC improved from the full dataset by 4.7% and 5.0% while referring 21.8% and 22% patients for MC Dropout Ensemble and Deep Ensemble, respectively. Conclusion: We found that many of the investigated methods provide overall similar but distinct utility in terms of predicting segmentation quality and referral performance. These findings are a critical first-step towards more widespread implementation of uncertainty quantification in OPC GTVp segmentation.

show abstract

Section: Introductionmentioning

confidence: 99%

Application of simultaneous uncertainty quantification for image segmentation with probabilistic deep learning: Performance benchmarking of oropharyngeal cancer target delineation as a use-case

Sahlsten

Jaskari

Wahid³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…However, stratification of evaluation metrics, as we have performed in our study, allows for ROI-specific thresholds that act as rough measures of clinical acceptability. Notably, our ROI-specific thresholds are derived from "gold standard" measurements provided by recognized experts within particular disease sites, which were established to have significantly improved segmentation consistency compared to non-experts in previous work 3 . When stratified by previously defined expert IOV cutoffs, the ROIs with the lowest percentage of observers that were able to cross cutoffs were often tumor volumes.…”

Section: Discussionmentioning

confidence: 99%

“…The number of experts used to derive the consensus segmentation was variable depending on the ROI; more information can be found in the corresponding data descriptor 7 . Though “experts” were subjectively determined in the original C3RO study, they demonstrated significantly improved interobserver variability compared to non-expert counterparts 3 . Therefore, the expert STAPLE can be considered as a “gold standard” segmentation to be used for comparison purposes.…”

Section: Methodsmentioning

confidence: 99%

“…The contouring collaborative for consensus in radiation oncology (C3RO), a large-scale crowdsourcing challenge for radiotherapy segmentation, demonstrated that non-expert consensus ROI segmentations could quantitatively approximate expert consensus ROI segmentations in a variety of disease sites 3 , thereby motivating the potential use of a large number of lower-quality segmentations in place of a small number of high-quality segmentations for AI auto-segmentation model training. Notably, segmentations were highly variable among the participants of C3RO, pointing to the existence of underlying factors associated with the resultant segmentation quality.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Determining The Role Of Radiation Oncologist Demographic Factors On Segmentation Quality: Insights From A Crowd-Sourced Challenge Using Bayesian Estimation

Wahid,

Sahin,

Kundu

et al. 2023

Preprint

Self Cite

View full text Add to dashboard Cite

BACKGROUND: Medical image auto-segmentation is poised to revolutionize radiotherapy workflows. The quality of auto-segmentation training data, primarily derived from clinician observers, is of utmost importance. However, the factors influencing the quality of these clinician-derived segmentations have yet to be fully understood or quantified. Therefore, the purpose of this study was to determine the role of common observer demographic variables on quantitative segmentation performance. METHODS: Organ at risk (OAR) and tumor volume segmentations provided by radiation oncologist observers from the Contouring Collaborative for Consensus in Radiation Oncology public dataset were utilized for this study. Segmentations were derived from five separate disease sites comprised of one patient case each: breast, sarcoma, head and neck (H&N), gynecologic (GYN), and gastrointestinal (GI). Segmentation quality was determined on a structure-by-structure basis by comparing the observer segmentations with an expert-derived consensus gold standard primarily using the Dice Similarity Coefficient (DSC); surface DSC was investigated as a secondary metric. Metrics were stratified into binary groups based on previously established structure-specific expert-derived interobserver variability (IOV) cutoffs. Generalized linear mixed-effects models using Markov chain Monte Carlo Bayesian estimation were used to investigate the association between demographic variables and the binarized segmentation quality for each disease site separately. Variables with a highest density interval excluding zero - loosely analogous to frequentist significance - were considered to substantially impact the outcome measure. RESULTS: After filtering by practicing radiation oncologists, 574, 110, 452, 112, and 48 structure observations remained for the breast, sarcoma, H&N, GYN, and GI cases, respectively. The median percentage of observations that crossed the expert DSC IOV cutoff when stratified by structure type was 55% and 31% for OARs and tumor volumes, respectively. Bayesian regression analysis revealed tumor category had a substantial negative impact on binarized DSC for the breast (coefficient mean +- standard deviation: -0.97 +- 0.20), sarcoma (-1.04 +- 0.54), H&N (-1.00 +- 0.24), and GI (-2.95 +- 0.98) cases. There were no clear recurring relationships between segmentation quality and demographic variables across the cases, with most variables demonstrating large standard deviations and wide highest density intervals. CONCLUSION: Our study highlights substantial uncertainty surrounding conventionally presumed factors influencing segmentation quality. Future studies should investigate additional demographic variables, more patients and imaging modalities, and alternative metrics of segmentation acceptability.

show abstract

“…Specific patient cases were selected by C3RO collaborators on the basis of being adequate reflections of routine patients a generalist radiation oncologist may see in a typical workflow (i.e., not overly complex). Further details on the study design for C3RO can be found in Lin & Wahid et al 19 . www.nature.com/scientificdata www.nature.com/scientificdata/ Imaging protocols.…”

Section: Methodsmentioning

confidence: 99%

Large scale crowdsourced radiotherapy segmentations across a variety of cancer anatomic sites

et al. 2023

View full text Add to dashboard Cite

Clinician generated segmentation of tumor and healthy tissue regions of interest (ROIs) on medical images is crucial for radiotherapy. However, interobserver segmentation variability has long been considered a significant detriment to the implementation of high-quality and consistent radiotherapy dose delivery. This has prompted the increasing development of automated segmentation approaches. However, extant segmentation datasets typically only provide segmentations generated by a limited number of annotators with varying, and often unspecified, levels of expertise. In this data descriptor, numerous clinician annotators manually generated segmentations for ROIs on computed tomography images across a variety of cancer sites (breast, sarcoma, head and neck, gynecologic, gastrointestinal; one patient per cancer site) for the Contouring Collaborative for Consensus in Radiation Oncology challenge. In total, over 200 annotators (experts and non-experts) contributed using a standardized annotation platform (ProKnow). Subsequently, we converted Digital Imaging and Communications in Medicine data into Neuroimaging Informatics Technology Initiative format with standardized nomenclature for ease of use. In addition, we generated consensus segmentations for experts and non-experts using the Simultaneous Truth and Performance Level Estimation method. These standardized, structured, and easily accessible data are a valuable resource for systematically studying variability in segmentation applications.

show abstract

E pluribus unum: prospective acceptability benchmarking from the Contouring Collaborative for Consensus in Radiation Oncology crowdsourced initiative for multiobserver segmentation

Cited by 11 publications

References 37 publications

Application of simultaneous uncertainty quantification for image segmentation with probabilistic deep learning: Performance benchmarking of oropharyngeal cancer target delineation as a use-case

Application of simultaneous uncertainty quantification for image segmentation with probabilistic deep learning: Performance benchmarking of oropharyngeal cancer target delineation as a use-case

Determining The Role Of Radiation Oncologist Demographic Factors On Segmentation Quality: Insights From A Crowd-Sourced Challenge Using Bayesian Estimation

Large scale crowdsourced radiotherapy segmentations across a variety of cancer anatomic sites

Contact Info

Product

Resources

About