BackgroundTumor purity is the percent of cancer cells present in a sample of tumor tissue. The non-cancerous cells (immune cells, fibroblasts, etc.) have an important role in tumor biology. The ability to determine tumor purity is important to understand the roles of cancerous and non-cancerous cells in a tumor.MethodsWe applied a supervised machine learning method, XGBoost, to data from 33 TCGA tumor types to predict tumor purity using RNA-seq gene expression data.ResultsAcross the 33 tumor types, the median correlation between observed and predicted tumor-purity ranged from 0.75 to 0.87 with small root mean square errors, suggesting that tumor purity can be accurately predicted υσινγ expression data. We further confirmed that expression levels of a ten-gene set (CSF2RB, RHOH, C1S, CCDC69, CCL22, CYTIP, POU2AF1, FGR, CCL21, and IL7R) were predictive of tumor purity regardless of tumor type. We tested whether our set of ten genes could accurately predict tumor purity of a TCGA-independent data set. We showed that expression levels from our set of ten genes were highly correlated (ρ = 0.88) with the actual observed tumor purity.ConclusionsOur analyses suggested that the ten-gene set may serve as a biomarker for tumor purity prediction using gene expression data.
Tumor purity is the percent of cancer cells present in a sample of tumor tissue. The noncancerous cells (stromal cells) in a tumor are thought to have an important role in tumor growth, metastatic progression, and drug resistance. They also strongly influence genomic analyses of tumor samples. The Cancer Genome Atlas (TCGA) has extensive RNA-seq data from tumor tissue samples as well as assessments of tumor purity for the samples. Our goal is to select a subset of genes whose expression levels are predictive of tumor purity for each tumor type as well as a subset of genes whose expression levels are predictive of all tumor type samples when pooled together. We hope that the genes selected may provide insight about the cell-type composition of tumor samples and about the similarities and differences in tumor microenvironments. We use data from the TCGA, which covers 11 different tumor types and includes genome-wide assessments on over 3,148 samples for gene expression. To identify predictive genes, we used XGBoost, a supervised machine learning algorithm based on the idea of a boosted regression tree ensemble. We carried out 100 repeated runs of 10-fold cross-validations (total of 1,000 train-test partitions) for each tumor type and, also, for all tumor types combined. Using the training-set samples, XGBoost selects a set of genes to predict tumor purity levels; the selected genes are subsequently used to predict the purity levels of the test-set samples. Across the 1,000 train-test partitions for all 11 tumor types, the average root-mean-squared error ranged from 0.09 to 0.16 for the test sets. For each tumor type, we selected the top 250 genes based on their aggregated feature importance scores, a measure of each gene's contribution to tumor purity estimation. No single gene was among the top 250 in all 11 tumor types; however, ACAP1, AMICA1, CSF2RB, CYTIP, GGT5, GLIPR1, IRF4, and PECAM1 were not only among the top 250 in more than 6 tumor types but also in the top 250 when all tumors were combined, suggesting those genes might serve as biomarkers for tumor purity. The most common pathways from gene ontology analysis of these top genes include various immune and signaling pathways. We used XGBoost to identify genes whose expression levels were associated with tumor purity levels in each tumor type. Our results suggest that assessed tumor purity levels in tumor samples can be faithfully recapitulated using certain subsets of genes. We believe that those genes selected for each tumor type by our unbiased approach might provide insight into the biology of the tumor microenvironment, e.g., the presence of cell type-specific marker genes would indicate the presence of specific cell types. Citation Format: YuanYuan Li, Adrienna Bingham, Qi-Jing Li, Yuan Zhuang, David M. Umbach, Leping Li. Using tumor sample gene expression data to infer tumor purity levels with stochastic gradient boosting machines [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2018; 2018 Apr 14-18; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2018;78(13 Suppl):Abstract nr 2255.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.