Sentiment analysis is an important sub-task of Natural Language Processing that aims to determine the polarity of a review. Most of the work done on sentiment analysis is for the resource-rich languages of the world, but very limited work has been done on resource-poor languages. In this work, we focus on developing a Sentiment Analysis System for Roman Urdu, which is a resource-poor language. To this end, a dataset of 11,000 reviews has been gathered from six different domains. Comprehensive annotation guidelines were defined and the dataset was annotated using the multi-annotator methodology. Using the annotated dataset, state-of-the-art algorithms were used to build a sentiment analysis system. To improve the results of these algorithms, four different studies were carried out based on: word-level features, character level features, and feature union. The best results showed that we could reduce the error rate by 12% from the baseline (80.07%). Also, to see if the improvements are statistically significant, we applied t-test and Confidence Interval on the obtained results and found that the best results of each study are statistically significant from the baseline.
Term weighting is one of the most commonly used approaches, which works by assigning weights to terms, that aims to improve the performance of information retrieval or text categorization tasks. In this paper, we present a novel term weighting technique, called discriminative feature spamming technique (DFST), which identifies distinctive terms, based on a term utility criteria (TUC), and then spams them to increase their discriminative power. The experimental results show that the DFST outperformed a set of time-tested term weighting schemes, from the information retrieval field. All the experiments were performed on the largest ever Roman Urdu (RU) dataset of 11000 reviews, which was collected and annotated for this work. In addition, a custom tokenizer was built, which further improved classification accuracy. A cross-scheme comparison was performed, which showed that the results obtained by using the newly proposed DFST, were statistically significant and better than previous approaches.
INDEX TERMSNatural languages, natural language processing, sentiment analysis, Roman Urdu.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.