Heum PARK†a) , Nonmember and Hyuk-Chul KWON †b) , Member SUMMARY This paper presents an improved Gini-Index algorithm to correct feature-selection bias in text classification. Gini-Index has been used as a split measure for choosing the most appropriate splitting attribute in decision tree. Recently, an improved Gini-Index algorithm for feature selection, designed for text categorization and based on Gini-Index theory, was introduced, and it has proved to be better than the other methods. However, we found that the Gini-Index still shows a feature selection bias in text classification, specifically for unbalanced datasets having a huge number of features. The feature selection bias of the Gini-Index in feature selection is shown in three ways: 1) the Gini values of low-frequency features are low (on purity measure) overall, irrespective of the distribution of features among classes, 2) for high-frequency features, the Gini values are always relatively high and 3) for specific features belonging to large classes, the Gini values are relatively lower than those belonging to small classes. Therefore, to correct that bias and improve feature selection in text classification using Gini-Index, we propose an improved Gini-Index (I-GI) algorithm with three reformulated Gini-Index expressions. In the present study, we used global dimensionality reduction (DR) and local DR to measure the goodness of features in feature selections. In experimental results for the I-GI algorithm, we obtained unbiased feature values and eliminated many irrelevant general features while retaining many specific features. Furthermore, we could improve the overall classification performances when we used the local DR method. The total averages of the classification performance were increased by 19.4 %, 15.9 %, 3.3 %, 2.8 % and 2.9 % (kNN) in Micro-F1, 14 %, 9.8 %, 9.2 %, 3.5 % and 4.3 % (SVM) in Micro-F1, 20 %, 16.9 %, 2.8 %, 3.6 % and 3.1 % (kNN) in Macro-F1, 16.3 %, 14 %, 7.1 %, 4.4 %, 6.3 % (SVM) in Macro-F1, compared with tf*idf, χ 2 , Information Gain, Odds Ratio and the existing Gini-Index methods according to each classifier.
This paper presents extended Relief algorithms and their use in instance-based feature filtering for document feature selection. The Relief algorithms are general and successful feature estimators that detect conditional dependencies of features between instances, and are applied in the preprocessing step for document classification and regression. Since the introduction the Relief algorithm, many kinds of extended Relief algorithms have been suggested as solutions to problems of redundancy, irrelevant and noisy features as well as Relief algorithm's limitations in two-class and multi-class datasets. In this paper, we introduce additional problems including the negative influence of computation similarities and weights caused by the small number of features in an instance, the absence of nearest Hits or nearest Misses for some instances using Relief algorithms, and other of problems. We suggest new extended Relief algorithms to solve those problems, having in the course of our research, and experimented on the estimation of the quality of features from instances, and classified datasets, and having compared the results of the new extended Relief algorithms. Indeed in the experimental results, the new extended Relief algorithms showed better performances for all of the datasets than did the Relief algorithms.
SUMMARYThis paper presents an extended Relief-F algorithm for nominal attribute estimation, for application to small-document classification. Relief algorithms are general and successful instance-based featurefiltering algorithms for data classification and regression. Many improved Relief algorithms have been introduced as solutions to problems of redundancy and irrelevant noisy features and to the limitations of the algorithms for multiclass datasets. However, these algorithms have only rarely been applied to text classification, because the numerous features in multiclass datasets lead to great time complexity. Therefore, in considering their application to text feature filtering and classification, we presented an extended Relief-F algorithm for numerical attribute estimation (E-Relief-F) in 2007. However, we found limitations and some problems with it. Therefore, in this paper, we introduce additional problems with Relief algorithms for text feature filtering, including the negative influence of computation similarities and weights caused by a small number of features in an instance, the absence of nearest hits and misses for some instances, and great time complexity. We then suggest a new extended Relief-F algorithm for nominal attribute estimation (E-Relief-Fd) to solve these problems, and we apply it to small text-document classification. We used the algorithm in experiments to estimate feature quality for various datasets, its application to classification, and its performance in comparison with existing Relief algorithms. The experimental results show that the new E-Relief-Fd algorithm offers better performance than previous Relief algorithms, including E-Relief-F.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.