Importance
Proper evaluation of the performance of artificial intelligence techniques in the analysis of digitized medical images is paramount for the adoption of such techniques by the medical community and regulatory agencies.
Objectives
To compare several cross-validation (CV) approaches to evaluate the performance of a classifier for automatic grading of prostate cancer in digitized histopathologic images and compare the performance of the classifier when trained using data from 1 expert and multiple experts.
Design, Setting, and Participants
This quality improvement study used tissue microarray data (333 cores) from 231 patients who underwent radical prostatectomy at the Vancouver General Hospital between June 27, 1997, and June 7, 2011. Digitized images of tissue cores were annotated by 6 pathologists for 4 classes (benign and Gleason grades 3, 4, and 5) between December 12, 2016, and October 5, 2017. Patches of 192 µm
2
were extracted from these images. There was no overlap between patches. A deep learning classifier based on convolutional neural networks was trained to predict a class label from among the 4 classes (benign and Gleason grades 3, 4, and 5) for each image patch. The classification performance was evaluated in leave-patches-out CV, leave-cores-out CV, and leave-patients-out 20-fold CV. The analysis was performed between November 15, 2018, and January 1, 2019.
Main Outcomes and Measures
The classifier performance was evaluated by its accuracy, sensitivity, and specificity in detection of cancer (benign vs cancer) and in low-grade vs high-grade differentiation (Gleason grade 3 vs grades 4-5). The statistical significance analysis was performed using the McNemar test. The agreement level between pathologists and the classifier was quantified using a quadratic-weighted κ statistic.
Results
On 333 tissue microarray cores from 231 participants with prostate cancer (mean [SD] age, 63.2 [6.3] years), 20-fold leave-patches-out CV resulted in mean (SD) accuracy of 97.8% (1.2%), sensitivity of 98.5% (1.0%), and specificity of 97.5% (1.2%) for classifying benign patches vs cancerous patches. By contrast, 20-fold leave-patients-out CV resulted in mean (SD) accuracy of 85.8% (4.3%), sensitivity of 86.3% (4.1%), and specificity of 85.5% (7.2%). Similarly, 20-fold leave-cores-out CV resulted in mean (SD) accuracy of 86.7% (3.7%), sensitivity of 87.2% (4.0%), and specificity of 87.7% (5.5%). Results of McNemar tests showed that the leave-patches-out CV accuracy, sensitivity, and specificity were significantly higher than those for both leave-patients-out CV and leave-cores-out CV. Similar results were observed for classifying low-grade cancer vs high-grade cancer. When trained on a single expert, the overall agreement in grading between pathologists and the classifier ranged from 0.38 to 0.58; when trained using the majority vote among all experts, it ...