<span>Sentiment analysis (SA) is an enduring area for research especially in the field of text analysis. Text pre-processing is an important aspect to perform SA accurately. This paper presents a text processing model for SA, using natural language processing techniques for twitter data. The basic phases for machine learning are text collection, text cleaning, pre-processing, feature extractions in a text and then categorize the data according to the SA techniques. Keeping the focus on twitter data, the data is extracted in domain specific manner. In data cleaning phase, noisy data, missing data, punctuation, tags and emoticons have been considered. For pre-processing, tokenization is performed which is followed by stop word removal (SWR). The proposed article provides an insight of the techniques, that are used for text pre-processing, the impact of their presence on the dataset. The accuracy of classification techniques has been improved after applying text pre-processing and dimensionality has been reduced. The proposed corpus can be utilized in the area of market analysis, customer behaviour, polling analysis, and brand monitoring. The text pre-processing process can serve as the baseline to apply predictive analysis, machine learning and deep learning algorithms which can be extended according to problem definition.</span>
Background: Evaluating the sentiments of tweets, blogs, comments and posts have become a crucial part of many applications. Sentiment analysis of social network data is very helpful for decision-making in application areas like movie reviews, product feedback and impact of the speech of a politician etc. The users often comment in their native languages or in slang languages or more often they use abbreviations and do not even stick to grammatical rules of the language. The bilingual and multilingual community often mixes two or more than two languages in their comments. Unavailability of annotated code-mixed data for native language adds to the difficulty in performing sentiment analysis. Objectives The motive of this article is to present the process of creating annotated corpus for code mixed social media text in Hindi, English and Hinglish which is collected from twitter. Ambiguous meaning words and inconsistent spellings of both the languages have also been included in the study to provide wide spread canvas. Method: This study will provide significant elements that should be considered while developing the annotated corpus of Hindi, Hinglish & English dataset. The annotation is calculated on the basis of polarity of words in three categories as positive, negative and neutral. There are words which have mixed feeling i.e. these words have positive as well as negative sentiments. To consider these words, inner agreement among the polarities has been considered. The words used for sarcasm or slangs have also been taken into account. The study has included ambiguous meaning and inconsistent spelling words of both languages as well. Findings: The proposed work provides a standard annotated corpus for code switched social media text in Hindi-English (Hinglish). The process of developing the corpus and calculating the polarity has been shown. It is found that if one considers the code-mixed text, the accuracy can be enhanced. Application: The proposed corpus can be utilized in the area of market analysis, customer behavior, polling analysis, brand monitoring, etc. The corpus serves as dataset which can further be extended according to the problem definition.
This chapter provides a basic understanding of processes and models needed to investigate the data posted by users on social networking sites like Facebook, Twitter, Instagram, etc. Often the databases of social networking sites are large and can't be handled using traditional methodology for analysis. Moreover, the data is posted in such a random manner that can't be used directly for the analysis purpose; therefore, a considerable preprocessing is needed to use that data and generate important results that can help in decision making for various areas like sentiment analysis, customer feedback, customer reviews for brand and product, prevention management, risk management, etc. Therefore, this chapter is discussing various aspects of text and its structure, various machine learning algorithms and their types, why machine learning is better for text analysis, the process of text analysis with the help of examples, issues associated with text analysis, and major application areas of text analysis.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.