The impact of preprocessing on text classification

Uysal, Alper Kürşat; Günal, Serkan

doi:10.1016/j.ipm.2013.08.006

Cited by 525 publications

(288 citation statements)

References 25 publications

Supporting

Mentioning

257

Contrasting

Unclassified

Order By: Relevance

“…The stop words were removed, as they do not convey any meaningful information. Finally, the textual content of the tweets was converted to lowercase characters, as Uysal and Gunal (2014) showed that lowercase conversion is an effective pre-processing step. As a last step, the text in Hindi was translated to English.…”

Section: Data Pre-processingmentioning

confidence: 99%

Event classification and location prediction from tweets during disasters

et al. 2017

View full text Add to dashboard Cite

Social media is a platform to express one's view in real time. This real time nature of social media makes it an attractive tool for disaster management, as both victims and officials can put their problems and solutions at the same place in real time. We investigate the Twitter post in a flood related disaster and propose an algorithm to identify victims asking for help. The developed system takes tweets as inputs and categorizes them into high or low priority tweets. User location of high priority tweets with no location information is predicted based on historical locations of the users using the Markov model. The system is working well, with its classification accuracy of 81%, and location prediction accuracy of 87%. The present system can be extended for use in other natural disaster situations, such as earthquake, tsunami, etc., as well as man-made disasters such as riots, terrorist attacks etc. The present system is first of its kind, aimed at helping victims during disasters based on their tweets.

show abstract

Section: Data Pre-processingmentioning

confidence: 99%

Event classification and location prediction from tweets during disasters

et al. 2017

View full text Add to dashboard Cite

show abstract

“…For this reason, they are, most of the time, assumed to be uninformative. However, there exists several efforts, which reveals this assumption is not always true [15]. As one can easily realize, stop-words are specific to the language.…”

Section: Preprocessing Methodsmentioning

confidence: 99%

The Impact of Text Representation and Preprocessing on Author Identification

Pak¹,

Günal²

2017

ANADOLU UNIVERSITY JOURNAL OF SCIENCE AND TECHNOLOGY a - Applied Sciences and Engineering

Self Cite

View full text Add to dashboard Cite

Author identification, one of the popular topics in text classification and natural language processing, basically aims to determine the author of a given text through various analyses. In the literature, different text representation approaches and use of preprocessing steps are considered for author identification problem. This paper aims to comprehensively examine the impact of text representation and preprocessing steps on author identification specifically for Turkish language. For this purpose, the contributions of all possible combinations of different text representation approaches, namely unigram and bigram, together with the preprocessing tasks, including stemming and stop-word removal, to the performance of author identification are investigated. For the experimental evaluation, a brand new dataset is constituted. Also, two different classification algorithms, namely Multinomial Naive Bayes and Sequential Minimal Optimization, are employed. The results of the experimental analysis reveal that using bigram features alone should be avoided. Besides, it is shown that stop-words should be kept inside the text while stemming can be preferred depending on the classification algorithm so that higher performance can be achieved for author identification.

show abstract

“…Refs. [42,43] suggest that feature selection is a very important stage in addition to feature extraction and classification. The selected data are moved to the preprocessing module in order to transform data to suit the learning algorithms, ultimately resulting in quality output.…”

Section: Preprocessingmentioning

confidence: 99%

“…After this process, any classifier can implement the text classification process by predicting the label of the document. The research community working in this field is still studying how to improve the performance of text classification by combining various preprocessing [43,46], feature extraction [47], feature selection [42,48], and ensemble methods [49]. The following features are extracted for the proposed model:…”

Section: Feature Extraction and Selectionmentioning

confidence: 99%

Prognosis Essay Scoring and Article Relevancy Using Multi-Text Features and Machine Learning

Mehmood

Lee

et al. 2017

Symmetry

View full text Add to dashboard Cite

Abstract:This study develops a model for essay scoring and article relevancy. Essay scoring is a costly process when we consider the time spent by an evaluator. It may lead to inequalities of the effort by various evaluators to apply the same evaluation criteria. Bibliometric research uses the evaluation criteria to find relevancy of articles instead. Researchers mostly face relevancy issues while searching articles. Therefore, they classify the articles manually. However, manual classification is burdensome due to time needed for evaluation. The proposed model performs automatic essay evaluation using multi-text features and ensemble machine learning. The proposed method is implemented in two data sets: a Kaggle short answer data set for essay scoring that includes four ranges of disciplines (Science, Biology, English, and English language Arts), and a bibliometric data set having IoT (Internet of Things) and non-IoT classes. The efficacy of the model is measured against the Tandalla and AutoP approach using Cohen's kappa. The model achieves kappa values of 0.80 and 0.83 for the first and second data sets, respectively. Kappa values show that the proposed model has better performance than those of earlier approaches.

show abstract

The impact of preprocessing on text classification

Cited by 525 publications

References 25 publications

Event classification and location prediction from tweets during disasters

Event classification and location prediction from tweets during disasters

The Impact of Text Representation and Preprocessing on Author Identification

Prognosis Essay Scoring and Article Relevancy Using Multi-Text Features and Machine Learning

Contact Info

Product

Resources

About