Objective To identify patterns of inter-expert discrepancy in plus disease diagnosis in retinopathy of prematurity (ROP). Design We developed two datasets of clinical images of varying disease severity (100 images and 34 images) as part of the Imaging and Informatics in ROP study, and determined a consensus reference standard diagnosis (RSD) for each image, based on 3 independent image graders and the clinical exam. We recruited 8 expert ROP clinicians to classify these images and compared the distribution of classifications between experts and the RSD. Subjects, Participants, and/or Controls Images obtained during routine ROP screening in neonatal intensive care units. 8 participating experts with >10 years of clinical ROP experience and >5 peer-reviewed ROP publications. Methods, Intervention, or Testing Expert classification of images of plus disease in ROP. Main Outcome Measures Inter-expert agreement (weighted kappa statistic), and agreement and bias on ordinal classification between experts (ANOVA) and the RSD (percent agreement). Results There was variable inter-expert agreement on diagnostic classifications between the 8 experts and the RSD (weighted kappa 0 – 0.75, mean 0.30). RSD agreement ranged from 80 – 94% agreement for the dataset of 100 images, and 29 – 79% for the dataset of 34 images. However, when images were ranked in order of disease severity (by average expert classification), the pattern of expert classification revealed a consistent systematic bias for each expert consistent with unique cut points for the diagnosis of plus disease and pre-plus disease. The two-way ANOVA model suggested a highly significant effect of both image and user on the average score (P<0.05, adjusted R2=0.82 for dataset A, and P< 0.05 and adjusted R2 =0.6615 for dataset B). Conclusions and Relevance There is wide variability in the classification of plus disease by ROP experts, which occurs because experts have different “cut-points” for the amounts of vascular abnormality required for presence of plus and pre-plus disease. This has important implications for research, teaching and patient care for ROP, and suggests that a continuous ROP plus disease severity score may more accurately reflect the behavior of expert ROP clinicians, and may better standardize classification in the future.
OBJECTIVE:To evaluate the clinical utility of a quantitative deep-learning derived vascular severity score for retinopathy of prematurity (ROP) by assessing its correlation with clinical ROP diagnosis and by measuring clinician agreement in applying a novel scale. DESIGN:Analysis of existing database of posterior pole fundus images and corresponding ophthalmoscopic examinations using two methods of assigning a quantitative scale to vascular severity. SUBJECTS AND PARTICIPANTS: Images were from clinical exams of patients in theImaging & Informatics in ROP consortium. 4 ophthalmologists and 1 study coordinator evaluated vascular severity on a 1-9 scale. METHODS:A quantitative vascular severity score (1-9) was applied to each image using a deep learning algorithm. A database of 499 images was developed for assessment of interobserver agreement. MAIN OUTCOME MEASURES: Distribution of deep learning derived vascular severityscores with the clinical assessment of zone (I,II,III), stage (0,1,2,3) and extent (<3, 3-6, >6 clock hours) of stage 3 evaluated using multivariable linear regression. Weighted kappa and Pearson correlation coefficients for inter-observer agreement on 1-9 vascular severity scale. RESULTS:For deep learning analysis, a total of 6344 clinical examinations were analyzed. A higher deep learning derived vascular severity score was associated with more posterior disease, higher disease stage, and higher extent of stage 3 disease (P<.001 for all). For a given ROP stage, the vascular severity score was higher in zone I than zone II or III (P<.001). For a given number of clock hours of stage 3, the severity score was higher in zone I than zone II (P=.03 in zone I and P<.001 in zone II). Multivariable regression found zone, stage, and extent were all independently associated with the severity score (P<.001 for all). For inter-observer agreement, mean (±Standard Deviation [SD]) weighted kappa was 0.67 (±0.06) and Pearson Correlation coefficient (±SD) was 0.88 (±.04) on the use of a 1-9 vascular severity scale. CONCLUSIONS:A vascular severity scale for ROP appears feasible for clinical adoption, corresponds with current international classification of ROP severity, and facilitates the use of objective technology such as deep learning to improve consistency of ROP diagnosis.
A tele-education system for ROP education was effective in improving the diagnostic accuracy of ROP by ophthalmologists-in-training in Mexico. This system has the potential to increase competency in ROP diagnosis and management for ophthalmologists-in-training from middle-income nations.
Objective To determine expert agreement on relative retinopathy of prematurity (ROP) disease severity, whether computer-based image analysis can model relative disease severity, and to propose consideration of a more continuous severity score for ROP. Design We developed two databases of clinical images of varying disease severity (100 images and 34 images) as part of the i-ROP (Imaging and Informatics in ROP) cohort study and recruited both expert physician, non-expert physician, and non-physician graders to classify and perform pairwise comparisons on both databases. Subjects, Participants, and/or Controls Images obtained during routine ROP screening in neonatal intensive care units. 6 participating expert ROP clinician-scientists, each with a minimum of 10 years clinical ROP experience and 5 ROP publications. 5 image graders (3 physicians and 2 non-physician graders). Methods Images in both databases were ranked by average disease classification (classification ranking) and by pairwise comparison using the Elo rating method (comparison ranking), and correlation with the i-ROP computer-based image analysis system. Main Outcome Measures Inter-expert agreement (weighted kappa statistic) compared with correlation coefficient (CC) between experts on pairwise comparisons, and correlation between expert rankings and computer-based image analysis modeling. Results There was variable inter-expert agreement on diagnostic classification of disease (plus, pre-plus, or normal) among the 6 experts (mean weighted kappa 0.27, range 0.06–0.63), but good correlation between experts on comparison ranking of disease severity (mean CC 0.84, range 0.74–0.93) on the set of 34 images. Comparison ranking provided a severity ranking that was in good agreement with ranking obtained by classification ranking (CC 0.92). Comparison ranking on the larger dataset by both expert and non-expert graders demonstrated good correlation (mean CC 0.97, range 0.95–0.98). The i-ROP system was able to model this continuous severity with good correlation (CC 0.86). Conclusions Experts diagnose plus disease on a continuum with poor absolute agreement on classification, but good relative agreement on disease severity. These results suggest that the use of pairwise rankings and a continuous severity score, such as that provided by the i-ROP system, may improve agreement on disease severity in the future.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.