Feature selection is critical in reducing the size of data and improving classifier accuracy by selecting an optimum subset of the overall features. Traditionally, each feature is given a score against a particular category (such as using Mutual Information) and the task of feature selection comes down to choosing the top k ranked features with the best average score across all categories. However, this approach has two major drawbacks. Firstly, the maximum or average score of a feature with a class might not necessarily determine its discriminating strength among samples of other classes. Secondly, most feature selection methods only use the scores to select the discriminating features from the corpus without taking into account the redundancy of information provided by the selected features. In this paper, we propose a new feature ranking score measure called the Discriminative Mutual Information (DMI) score. This score helps to select features that distinguish samples of one category against all other categories. Moreover, Non-Redundant Feature Selection (NRFS) heuristic is also proposed that explicitly takes the problem of feature redundancy into account when selecting the features set. The performance of our approach is investigated and compared with other feature selection techniques on datasets derived from high-dimensional text corpora using multiple classification algorithms. The results show that the proposed method leads to better classification micro-F1 score as compared to other state-of-the-art methods. In particular, the proposed method shows great improvement when the number of selected features are small as well as an overall higher robustness to label noise.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.