In the setting of a challenge competition, some deep learning algorithms achieved better diagnostic performance than a panel of 11 pathologists participating in a simulation exercise designed to mimic routine pathology workflow; algorithm performance was comparable with an expert pathologist interpreting whole-slide images without time constraints. Whether this approach has clinical utility will require evaluation in a clinical setting.
medRxiv preprintThis article discusses the effect of segregation of histopathology images data into three sets; training set for training machine learning model, validation set for model selection and test set for testing model performance. We found that one must be cautious when segregating histological images data (slides) into training, validation and test sets because subtle mishandling of data can introduce data leakage and gives illusively good results on the test set. We performed this study on gene mutation prediction performance by using the deep neural network in the paper of Coudray et al. [1]. By using the provided code and the same set of data, we discovered that data segregation method of the paper suffered from a data leakage problem [2]. The paper pools all the slides from all patients and then segregates them exclusively into training, validation and test sets. In this way, none of the slides is used in more than one set. This seems to be a clean separation of the data. However, the paper did not consider that some slides were strongly correlated. For example, if the tumor of a patient is cut and stained to produce multiple slides, these slides are strongly correlated. If one slide is used for training and another one is used for testing, essentially, the deep neural network can memorize the pattern on the slide in the training set and apply this memory on the slide in the test set. Hence, by memorization, the deep neural network can predict very well on the slide in the test set. This mechanism of prediction is not useful in a practical clinical setting since no two tumors are the same in the real world. In this real setting, we demand the deep neural network to generalize across patients and tumors. Hereafter, we call this way of data segregation slide-level segregation. There is a better way to perform data segregation that is compatible for deployment of deep learning model in practical clinical settings. First, the patients are segregated exclusively into training, validation and test sets. All the slides belonging to the patients in the training set are used solely for training. Similarly, all the slides belonging to the patients in the test set are used for testing only. Segregation of data in this way forces the deep neural network to generalize across patients. We call this way of data segregation patient-level segregation.In slide-level segregation approach analysis, we obtained similar results to that presented in the paper by Coudray et al. [1]: overall performance on the test set was good. However, it was illusory due to data leakage. The model gave very good testing results on the slides that come from a patient who also has slides in the training set. On the other hand, the test result was quite bad on the slides that come from a patient who does not have any slides in the training set. Hereafter, we call the slide in the test set as seen-2 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the pre...
Tumor purity is the proportion of cancer cells in the tumor tissue. An accurate tumor purity estimation is crucial for accurate pathologic evaluation and for sample selection to minimize normal cell contamination in high throughput genomic analysis. We developed a novel deep multiple instance learning model predicting tumor purity from H&E stained digital histopathology slides. Our model successfully predicted tumor purity from slides of fresh-frozen sections in eight different TCGA cohorts and formalin-fixed paraffin-embedded sections in a local Singapore cohort. The predictions were highly consistent with genomic tumor purity values, which were inferred from genomic data and accepted as the golden standard. Besides, we obtained spatially resolved tumor purity maps and showed that tumor purity varies spatially within a sample. Our analyses on tumor purity maps also suggested that pathologists might have chosen high tumor content regions inside the slides during tumor purity estimation in the TCGA cohorts, which resulted in higher values than genomic tumor purity values. In short, our model can be utilized for high throughput sample selection for genomic analysis, which will help reduce pathologists’ workload and decrease inter-observer variability. Moreover, spatial tumor purity maps can help better understand the tumor microenvironment as a key determinant in tumor formation and therapeutic response.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.