BackgroundThe wealth of gene expression values being generated by high throughput microarray technologies leads to complex high dimensional datasets. Moreover, many cohorts have the problem of imbalanced classes where the number of patients belonging to each class is not the same. With this kind of dataset, biologists need to identify a small number of informative genes that can be used as biomarkers for a disease.ResultsThis paper introduces a Balanced Iterative Random Forest (BIRF) algorithm to select the most relevant genes for a disease from imbalanced high-throughput gene expression microarray data. Balanced iterative random forest is applied on four cancer microarray datasets: a childhood leukaemia dataset, which represents the main target of this paper, collected from The Children’s Hospital at Westmead, NCI 60, a Colon dataset and a Lung cancer dataset. The results obtained by BIRF are compared to those of Support Vector Machine-Recursive Feature Elimination (SVM-RFE), Multi-class SVM-RFE (MSVM-RFE), Random Forest (RF) and Naive Bayes (NB) classifiers. The results of the BIRF approach outperform these state-of-the-art methods, especially in the case of imbalanced datasets. Experiments on the childhood leukaemia dataset show that a 7% ∼ 12% better accuracy is achieved by BIRF over MSVM-RFE with the ability to predict patients in the minor class. The informative biomarkers selected by the BIRF algorithm were validated by repeating training experiments three times to see whether they are globally informative, or just selected by chance. The results show that 64% of the top genes consistently appear in the three lists, and the top 20 genes remain near the top in the other three lists.ConclusionThe designed BIRF algorithm is an appropriate choice to select genes from imbalanced high-throughput gene expression microarray data. BIRF outperforms the state-of-the-art methods, especially the ability to handle the class-imbalanced data. Moreover, the analysis of the selected genes also provides a way to distinguish between the predictive genes and those that only appear to be predictive.
Electronic Healthcare Records (EHRs) describe the details about a patient's physical and mental health, diagnosis, lab results, treatments or patient care plan and so forth. Currently, the International Classification of Diseases, 10 th Revision or ICD-10 code is used for representing each patient record. The huge amount of information in these records provides insights about the diagnosis and prediction of various diseases. Various data mining techniques are used for the analysis of data deriving from these patient records. Recurrent Neural Network (RNN) is a powerful and widely used technique in machine learning and bioinformatics. This research aims at the investigation of RNN with Long Short-Term Memory (LSTM) hidden units. The empirical research is intended to evaluate the ability of LSTMs to recognize patterns in multi-label classification of cerebrovascular symptoms or stroke. First, we integrated ICD-10 code into health record, as well as other potential risk factors within EHRs into the pattern and model for prediction. Next, we modelled the effectiveness of LSTMs for prediction of stroke based on healthcare records. The results show several strong baselines that include accuracy, recall, and F1 measure score.
The identification of a subset of genes having the ability to capture the necessary information to distinguish classes of patients is crucial in bioinformatics applications. Ensemble and bagging methods have been shown to work effectively in the process of gene selection and classification. Testament to that is random forest which combines random decision trees with bagging to improve overall feature selection and classification accuracy. Surprisingly, the adoption of these methods in support vector machines has only recently received attention but mostly on classification not gene selection. This paper introduces an ensemble SVM-Recursive Feature Elimination (ESVM-RFE) for gene selection that follows the concepts of ensemble and bagging used in random forest but adopts the backward elimination strategy which is the rationale of RFE algorithm. The rationale behind this is, building ensemble SVM models using randomly drawn bootstrap samples from the training set, will produce different feature rankings which will be subsequently aggregated as one feature ranking. As a result, the decision for elimination of features is based upon the ranking of multiple SVM models instead of choosing one particular model. Moreover, this approach will address the problem of imbalanced datasets by constructing a nearly balanced bootstrap sample. Our experiments show that ESVM-RFE for gene selection substantially increased the classification performance on five microarray datasets compared to state-of-the-art methods. Experiments on the childhood leukaemia dataset show that an average 9% better accuracy is achieved by ESVM-RFE over SVM-RFE, and 5% over random forest based approach. The selected genes by the ESVM-RFE algorithm were further explored with Singular Value Decomposition (SVD) which reveals significant clusters with the selected data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.