Background One of the main drawbacks in constructing a classification model is that some or all of the covariates are categorical variables. Classical methods either assign labels to each output of a categorical variable or are summarised measures (frequencies and percentages), which can be interpreted as probabilities. Methods We adopted a novel mathematical procedure to construct a classification model from categorical variables based on a non-classical probability approach. More specifically, we codified the variables following the categorical data representation from the Discriminant Correspondence Analysis before constructing a non-classical probability matrix system that represents an entangled system of dependent-independent variables. We then developed a disentangled procedure to obtain an empirical density function for each representative class (minimum of two classes). Finally, we constructed our classification model using the density functions. Results We applied the proposed procedure to build a classification model of the malignancy of Solitary Pulmonary Nodule (SPN) after five years of follow up using routine clinical data. First, with 2/3 (270) of the sample of 404 patients with SPN, we constructed the classification model, and then validated it with the remaining 1/3 (134) we validated it. We tested the procedure’s stability by repeating the analysis randomly 1000 times. We obtained a model accuracy of 0.74, an F1 score of 0.58, a Cohen’s Kappa value of 0.41 and a Matthews Correlation Coefficient of 0.45. Finally, the area under the ROC curve was 0.86. Conclusion The proposed procedure provides a machine learning classification model with an acceptable performance of a classification model of solitary pulmonary nodule malignancy constructed from routine clinical data and mainly composed of categorical variables. It provides an acceptable performance, which could be used by clinicians as a tool to classify SPN malignancy in routine clinical practice.
Progression analysis of disease (PAD) is a methodology that incorporates the output of Disease Specific Genomic Analyses (DSGA) to an unsupervised classification scheme based on Topological Data Analysis (TDA). PAD makes use of data derived from healthy individuals to split individual diseased samples into healthy and disease components. Then, the shape characteristics of the disease component are extracted trough the generation of a combinatioral graph by means of the Mapper algorithm. In this paper we introduce a new filtering function for the Mapper algorithm that naturally integrates information on genes linked to disease-free or overall survival. We propose a new PAD9 extended methodology termed Progression Analysis of Disease with Survival (PAD-S) and implement it in an R package called SurvMap which allows users to carry out all the steps involved in PAD11 S, as well as in traditional PAD analyses. We tested PAD-S methodology using SurvMap on a large combined transcriptomics breast cancer dataset demonstrating its capacity to identify sets of samples displaying highly significant differences in terms of disease free survival (p = 8 x 10−14) and idiosyncratic biological features. PAD-S and SurvMap were also able to identify sets of samples with significantly different relapse-free survivals and molecular profiles inside breast cancer intrinsic subgroups (luminal A, luminal B, Her2, and basal). Finally, to illustrate that PAD-S and SurvMap are general-purpose analysis tools that can be applied to different types of omics data, we also carried out analyses in a breast cancer methylation dataset derived from The Cancer Genome Atlas (TCGA) identifying groups of patients with significant differences in terms of overall survival and methylation profiles.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.