Datasets produced in modern research, such as biomedical science, pose a number of challenges for machine learning techniques used in binary classification due to high dimensionality. Feature selection is one of the most important statistical techniques used for dimensionality reduction of the datasets. Therefore, techniques are needed to find an optimal number of features to obtain more desirable learning performance. In the machine learning context, gene selection is treated as a feature selection problem, the objective of which is to find a small subset of the most discriminative features for the target class. In this paper, a gene selection method is proposed that identifies the most discriminative genes in two stages. Genes that unambiguously assign the maximum number of samples to their respective classes using a greedy approach are selected in the first stage. The remaining genes are divided into a certain number of clusters. From each cluster, the most informative genes are selected via the lasso method and combined with genes selected in the first stage. The performance of the proposed method is assessed through comparison with other stateof-the-art feature selection methods using gene expression datasets. This is done by applying two classifiers i.e., random forest and support vector machine, on datasets with selected genes and training samples and calculating their classification accuracy, sensitivity, and Brier score on samples in the testing part. Boxplots based on the results and correlation matrices of the selected genes are thenceforth constructed. The results show that the proposed method outperforms the other methods. INDEX TERMS Clustering, classification, feature selection, high dimensional data, microarray gene expression data.
kNN based ensemble methods minimise the effect of outliers by identifying a set of data points in the given feature space that are nearest to an unseen observation in order to predict its response by using majority voting. The ordinary ensembles based on kNN find out the k nearest observations in a region (bounded by a sphere) based on a predefined value of k. This scenario, however, might not work in situations when the test observation follows the pattern of the closest data points with the same class that lie on a certain path not contained in the given sphere. This paper proposes a k nearest neighbour ensemble where the neighbours are determined in k steps. Starting from the first nearest observation of the test point, the algorithm identifies a single observation that is closest to the observation at the previous step. At each base learner in the ensemble, this search is extended to k steps on a random bootstrap sample with a random subset of features selected from the feature space. The final predicted class of the test point is determined by using a majority vote in the predicted classes given by all base models. This new ensemble method is applied on 17 benchmark datasets and compared with other classical methods, including kNN based models, in terms of classification accuracy, kappa and Brier score as performance metrics. Boxplots are also utilised to illustrate the difference in the results given by the proposed and other state-of-the-art methods. The proposed method outperformed the rest of the classical methods in the majority of cases. The paper gives a detailed simulation study for further assessment.
In this paper a new method is proposed for solving the linear regression problem when the number of observations n is smaller than the number of predictors v. This method uses the idea of graphical models and provides unbiased parameter estimates under certain conditions, while existing methods such as ridge regression, least absolute shrinkage and selection operator (LASSO) and least angle regression (LARS) give biased estimates. Also the new method can provide a detailed graphical correlation structure for the predictors, therefore the real causal relationship between predictors and response could be identified. In contrast, existing methods often cannot identify the real important predictors which have possible causal effects on the response variable. Unlike the existing methods based on graphical models, the proposed method can identify the potential networks while doing regression even if the data do not follow a multivariate distribution. The new method is compared with some existing methods such as ridge regression, LASSO and LARS by using simulated and real data sets. Our experiments reveal that the new method outperforms all the other methods when n < v.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.