Objectives: In this study, we aimed to identify putative biomarkers for identification and characterization of these cells in liver cancer.
Methods: We employed a supervised machine learning method, XGBoost, to data from 13 GEO data series to classify samples using gene expression data.
Results. Across the 376 samples (129 CSCs and 247 non-CSCs cases), XGBoost displayed high performance in the classification of data. XGBoost feature importance scores and SHAP (Shapley Additive explanation) values were used for the interpretation of results and analysis of individual gene importance. We confirmed that expression levels of a 10-gene set (PTGER3, AURKB, C15orf40, IDI2, OR8D1, NACA2, SERPINB6, L1CAM, SMC1A, and RASGRF1) were predictive. The results showed that these 10 genes can detect CSCs robustly with accuracy, sensitivity, and specificity of 97 %, 100 %, and 95 %, respectively.
Conclusions. We suggest that the ten-gene set may be used as a biomarker set for detecting and characterizing CSCs using gene expression data.