The purpose of this work was to characterize expert variation in segmentation of intracranial structures pertinent to radiation therapy, and to assess a registration-driven atlas-based segmentation algorithm in that context. Eight experts were recruited to segment the brainstem, optic chiasm, optic nerves, and eyes, of 20 patients who underwent therapy for large space-occupying tumors. Performance variability was assessed through three geometric measures: volume, Dice similarity coefficient, and Euclidean distance. In addition, two simulated ground truth segmentations were calculated via the simultaneous truth and performance level estimation (STAPLE) algorithm and a novel application of probability maps. The experts and automatic system were found to generate structures of similar volume, though the experts exhibited higher variation with respect to tubular structures. No difference was found between the mean Dice coefficient (DSC) of the automatic and expert delineations as a group at a 5% significance level over all cases and organs. The larger structures of the brainstem and eyes exhibited mean DSC of approximately 0.8–0.9, whereas the tubular chiasm and nerves were lower, approximately 0.4–0.5. Similarly low DSC have been reported previously without the context of several experts and patient volumes. This study, however, provides evidence that experts are similarly challenged. The average maximum distances (maximum inside, maximum outside) from a simulated ground truth ranged from (−4.3, +5.4) mm for the automatic system to (−3.9, +7.5) mm for the experts considered as a group. Over all the structures in a rank of true positive rates at a 2 mm threshold from the simulated ground truth, the automatic system ranked second of the nine raters. This work underscores the need for large scale studies utilizing statistically robust numbers of patients and experts in evaluating quality of automatic algorithms.
Identification of error in non-rigid registration is a critical problem in the medical image processing community. We recently proposed an algorithm that we call “Assessing Quality Using Image Registration Circuits” (AQUIRC) to identify non-rigid registration errors and have tested its performance using simulated cases. In this article, we extend our previous work to assess AQUIRC’s ability to detect local non-rigid registration errors and validate it quantitatively at specific clinical landmarks, namely the Anterior Commissure (AC) and the Posterior Commissure (PC). To test our approach on a representative range of error we utilize 5 different registration methods and use 100 target images and 9 atlas images. Our results show that AQUIRC’s measure of registration quality correlates with the true target registration error (TRE) at these selected landmarks with an R2 = 0.542. To compare our method to a more conventional approach, we compute Local Normalized Correlation Coefficient (LNCC) and show that AQUIRC performs similarly. However, a multi-linear regression performed with both AQUIRC’s measure and LNCC shows a higher correlation with TRE than correlations obtained with either measure alone, thus showing the complementarity of these quality measures. We conclude the article by showing that the AQUIRC algorithm can be used to reduce registration errors for all five algorithms.
Image segmentation has become a vital and often rate limiting step in modern radiotherapy treatment planning. In recent years the pace and scope of algorithm development, and even introduction into the clinic, have far exceeded evaluative studies. In this work we build upon our previous evaluation of a registration driven segmentation algorithm in the context of 8 expert raters and 20 patients who underwent radiotherapy for large space-occupying tumors in the brain. In this work we tested four hypotheses concerning the impact of manual segmentation editing in a randomized single-blinded study. We tested these hypotheses on the normal structures of the brainstem, optic chiasm, eyes and optic nerves using the Dice similarity coefficient, volume, and signed Euclidean distance error to evaluate the impact of editing on inter-rater variance and accuracy. Accuracy analyses relied on two simulated ground truth estimation methods: STAPLE and a novel implementation of probability maps. The experts were presented with automatic, their own, and their peers’ segmentations from our previous study to edit. We found, independent of source, editing reduced inter-rater variance while maintaining or improving accuracy and improving efficiency with at least 60% reduction in contouring time. In areas where raters performed poorly contouring from scratch, editing of the automatic segmentations reduced the prevalence of total anatomical miss from approximately 16% to 8% of the total slices contained within the ground truth estimations. These findings suggest that contour editing could be useful for consensus building such as in developing delineation standards, and that both automated methods and even perhaps less sophisticated atlases could improve efficiency, inter-rater variance, and accuracy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.