Rationale and Objectives: Radiographic findings of COVID-19 pneumonia can be used for patient risk stratification; however, radiologist reporting of disease severity is inconsistent on chest radiographs (CXRs). We aimed to see if an artificial intelligence (AI) system could help improve radiologist interrater agreement.
Materials and Methods:We performed a retrospective multi-radiologist user study to evaluate the impact of an AI system, the PXS score model, on the grading of categorical COVID-19 lung disease severity on 154 chest radiographs into four ordinal grades (normal/minimal, mild, moderate, and severe). Four radiologists (two thoracic and two emergency radiologists) independently interpreted 154 CXRs from 154 unique patients with COVID-19 hospitalized at a large academic center, before and after using the AI system (median washout time interval was 16 days). Three different thoracic radiologists assessed the same 154 CXRs using an updated version of the AI system trained on more imaging data. Radiologist interrater agreement was evaluated using Cohen and Fleiss kappa where appropriate. The lung disease severity categories were associated with clinical outcomes using a previously published outcomes dataset using Fisher's exact test and Chi-square test for trend.Results: Use of the AI system improved radiologist interrater agreement (Fleiss k = 0.40 to 0.66, before and after use of the system). The Fleiss k for three radiologists using the updated AI system was 0.74. Severity categories were significantly associated with subsequent intubation or death within 3 days.
Conclusion:An AI system used at the time of CXR study interpretation can improve the interrater agreement of radiologists.