A new metric for feature selection on short text datasets

Çekik, Rasim; Uysal, Alper Kürşat

doi:10.1002/cpe.6909

Cited by 7 publications

(6 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The chi‐square test is believed to yield better results when the input variables are categorical or numerical and the output variable is categorical. A chi‐square test 54 is a common and widely used statistical test that reveals whether two variables have a statistically significant relationship ( p < 0.0001). We used Chi‐square scores and p values to identify important variables that have a significant impact on the dependent variable (injury severity).…”

Section: Methodsmentioning

confidence: 99%

Application of rough set theory and machine learning algorithms in predicting accident outcomes in the Indian petroleum industry

Gangadhari

Khanzode

Murthy

2022

Concurrency and Computation

View full text Add to dashboard Cite

Summary Recent advancements in machine learning techniques are helping researchers to develop predictive models that assist decision‐makers to get a quick, unbiased overview of the processes. But studies using machine learning approaches in analyzing and classifying the injury narratives of the petroleum industries are still in their early stages due to data unavailability and lack of trust in these models. Comparatively, other industries such as construction, manufacturing, aviation and so forth are using the findings from the predictive models but the use of machine learning techniques in analyzing petroleum industry accident data is not gaining much importance. This study aims to use available accident data from the Indian petroleum industry to develop a classification model for predicting possible outcomes of an accident. The data used in this study comprises 194 accident reports with 20 information attributes collected during the 2016–20 period. Six different machine learning algorithms are used to analyze and classify the possible outcome of the accident. It has been found that the Xgboost algorithm has achieved 95% accuracy following multilayer perceptron with 94% accuracy. The rough set theory is also used to extract the indiscernibility relationship between the given attributes causing accident occurrence. The results indicate that “skill‐based error, supervisory violation, no personal protective equipment usage, and lack of standard operating procedure compliance” have contributed to the majority of the accidents. The findings of this study can be used to assist safety professionals in decision‐making, mitigating the root causes of contributing factors.

show abstract

Section: Methodsmentioning

confidence: 99%

Application of rough set theory and machine learning algorithms in predicting accident outcomes in the Indian petroleum industry

Gangadhari

Khanzode

Murthy

2022

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…For example, there are traditional methods such as Information Gain (IG), Gain Ratio (GR), Gini Index (GI), Chi2, Mutual Information (MI) (Sharmin et al, 2019) as well as recently proposed approaches such as DFS, NDM, MMR and MRDC. Many of these methods are widely used in applications such as text classification (Cekik & Uysal, 2022). The IG approach is commonly used, particularly in data and text mining.…”

Section: Related Workmentioning

confidence: 99%

A New Feature Selection Metric Based on Rough Sets and Information Gain in Text Classification

ÇEKİK,

KAYA

2023

Gazi University Journal of Science Part A: Engineering and Innovation

Self Cite

View full text Add to dashboard Cite

In text classification, taking words in text documents as features creates a very high dimensional feature space. This is known as the high dimensionality problem in text classification. The most common and effective way to solve this problem is to select an ideal subset of features using a feature selection approach. In this paper, a new feature selection approach called Rough Information Gain (RIG) is presented as a solution to the high dimensionality problem. Rough Information Gain extracts hidden and meaningful patterns in text data with the help of Rough Sets and computes a score value based on these patterns. The proposed approach utilizes the selection strategy of the Information Gain Selection (IG) approach when pattern extraction is completely uncertain. To demonstrate the performance of the Rough Information Gain in the experimental studies, the Micro-F1 success metric is used to compare with Information Gain Selection (IG), Chi-Square (CHI2), Gini Coefficient (GI), Discriminative Feature Selector (DFS) approaches. The proposed Rough Information Gain approach outperforms the other methods in terms of performance, according to the results.

show abstract

“…2) Comparison with Chi 2 Feature Selection Chi 2 [79], [80] statistical test has been used in text feature selection based on statistical significance of features. We selected the top "1355 features" using the Chi 2 method to compare the results with our best findings.…”

Section: Expidmentioning

confidence: 99%

Modified Genetic Algorithm for Feature Selection and Hyper Parameter Optimization: Case of XGBoost in Spam Prediction

2022

View full text Add to dashboard Cite

Recently, spam on online social networks has attracted attention in the research and business world. Twitter has become the preferred medium to spread spam content. Many research efforts attempted to encounter social networks spam. Twitter brought extra challenges represented by the feature space size, and imbalanced data distributions. Usually, the related research works focus on part of these main challenges or produce black-box models. In this paper, we propose a modified genetic algorithm for simultaneous dimensionality reduction and hyper parameter optimization over imbalanced datasets. The algorithm initialized an eXtreme Gradient Boosting classifier and reduced the features space of tweets dataset; to generate a spam prediction model. The model is validated using a 50 times repeated 10-fold stratified cross-validation, and analyzed using nonparametric statistical tests. The resulted prediction model attains on average 82.32% and 92.67% in terms of geometric mean and accuracy respectively, utilizing less than 10% of the total feature space. The empirical results show that the modified genetic algorithm outperforms Chi 2 and P CA feature selection methods. In addition, eXtreme Gradient Boosting outperforms many machine learning algorithms, including BERT-based deep learning model, in spam prediction. Furthermore, the proposed approach is applied to SMS spam modeling and compared to related works.

show abstract

A new metric for feature selection on short text datasets

Cited by 7 publications

References 29 publications

Application of rough set theory and machine learning algorithms in predicting accident outcomes in the Indian petroleum industry

Application of rough set theory and machine learning algorithms in predicting accident outcomes in the Indian petroleum industry

A New Feature Selection Metric Based on Rough Sets and Information Gain in Text Classification

Modified Genetic Algorithm for Feature Selection and Hyper Parameter Optimization: Case of XGBoost in Spam Prediction

Contact Info

Product

Resources

About