Background: Machine learning (ML) methods are becoming more feasible for use in clinical and epidemiologic research of breast cancer, particularly when characterizing histopathology. Compared to supervised ML methods, unsupervised approaches represent an opportunity to distinguish features heretofore unknown. The purpose of this study was to use unsupervised deep learning methods to identify histopathological features in diagnostic breast cancer hematoxylin and eosin (H&E) slides that are associated with clinical characteristics and patient outcomes. Methods: One H&E slide was scanned (Leica Biosystems Aperio Versa scanner) at 20x magnification for each of 1,716 women diagnosed with breast cancer from the Cancer Prevention Study-II Nutrition Cohort. In the pre-processing phase, the scanned images underwent color normalization, artifact detection, and tiling. We then used an un-pretrained VGG16 autoencoder with data augmentation for feature learning and extraction from tiles. These features were two-tiered clustered using the K-means algorithm. Each tile was assigned the cluster with the highest probability. The tiles were reassembled into whole slide images. For each slide, the proportion of tiles in each cluster was calculated. We will associate clusters with clinical features and 5- and 10-year breast cancer-specific survival using multivariable logistic and Cox proportional hazards regression models, respectively. Results: Mean age at baseline enrollment (1992-1993) and breast cancer diagnosis for the cases was 60.6 years (SD=6.0) and 71.5 years (SD=7.0), respectively. The majority of cancer diagnoses occurred after 1999 (79%) and 81% of women included were diagnosed invasive breast cancer. The final pipeline for the full set of images is currently being built. Preliminary runs at the 1x magnification level with 100 cases (N=21,472 tiles) have shown clustering based on macro-level features such as adipose, stromal and epithelial content. Second-tier clustering (clustering within clusters) shows further delineation of groups within clusters of interest (i.e. epithelial-cell rich regions). The final output with all 1,716 slides will be based on analysis at the 5x magnification level. Discussion: We expect that some histopathological features identified by ML models will be associated with conventional pathology features, clinical features, and breast cancer-specific survival. Utilization of ML methods for analyzing histology slides provides additional data that can be integrated into epidemiological studies. Future directions include analyzing images at higher magnifications (10x or 20x) and assessing the association between ML histopathological characteristics and breast cancer risk factors and incorporating these characteristics into prognostic models. Citation Format: Samantha Puvanesarajah, James M. Hodge, Jacob L. Evans, William Seo, Michelle Yi, Michelle M. Fritz, Mary Macheski-Preston, Ted Gansler, Susan M. Gapstur, Mia M. Gaudet. Unsupervised deep-learning to identify histopathological features among breast cancers in the Cancer Prevention Study-II Nutrition Cohort [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2019; 2019 Mar 29-Apr 3; Atlanta, GA. Philadelphia (PA): AACR; Cancer Res 2019;79(13 Suppl):Abstract nr 2417.
Digital pathology images potentially contain novel patterns that may be perceived by modern deep learning models, but not humans. Prior unsupervised pattern recognition approaches have been used to reveal prognostically-relevant subtypes of glioblastoma (PMID: 28984190) and breast density segmentation (PMID: 26915120), and may complement supervised machine learning models trained using labeled data. In the Cancer Prevention Study II (CPS-II) cohort (PMID: 12015775), high-resolution, digitized hemotoxylin and eosin diagnostic slides are available for approximately 1,700 breast cancer cases providing an opportunity to perform unsupervised pattern recognition image analysis for epidemiologic breast cancer studies. Given the size of the dataset and complexity of the models, we constructed an end-to-end analytical pipeline, including preprocessing, feature engineering, and clustering, using cloud-based technologies that enable analysis at scale. Prior to training the unsupervised models, we faced issues converting raw images with open-source software. Specifically, OpenSlides could not open the Leica Versa SCN files due to their proprietary format while BioFormats inverted colors. To fix these issues, we altered the BioFormats library to successfully convert the files into a TIFF format. Since this issue likely affects other researchers, we are in discussions to provide the fix under a public license. TIFF formatted images were then denoised through color normalization to reduce hue variance and artifact detection to remove unwanted features such as pathologist annotations. Due to the computational complexity of analyzing the full image, images were padded with white space to ensure divisibility and broken into nine tiles of a predefined size. To further reduce computation time, uninformative tiles were filtered based on a predetermined threshold of artifact and white space composition. The remaining tiles were input to the unsupervised models. We used convolutional autoencoders, specifically a modified VGG-16 model without pretrained weights and a deep embedded clustering algorithm. These models learn representations of the images called ‘feature vectors’ and encode the images’ salient patterns. The final model was chosen based on iterative testing on a subsample of 100 images (N=21,472 tiles) and performance comparison of various VGG-inspired autoencoders. The feature vectors were clustered by K-means to summarize the information in a format suitable for statistical analyses. Our initial results show that the system captures macro-scale tissue patterns at lower magnifications (1x and 5x) and produces clusters that can be integrated into epidemiological studies of breast cancer etiology and prognosis. Citation Format: Jacob L. Evans, William Seo, Mary Macheski-Preston, Michelle Fritz, Samantha Puvanesarajah, James Hodge, Ted Gansler, Susan Gapstur, Mia M. Gaudet, Michelle Yi. A scalable, cloud-based, unsupervised deep learning system for identification, extraction, and summarization of potentially imperceptible patterns in whole-slide images of breast cancer tissue [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2019; 2019 Mar 29-Apr 3; Atlanta, GA. Philadelphia (PA): AACR; Cancer Res 2019;79(13 Suppl):Abstract nr 1635.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.