Sentiment analysis is a text mining task that determines the polarity of a given text, i.e., its positiveness or negativeness. Recently, it has received a lot of attention given the interest in opinion mining in micro-blogging platforms. These new forms of textual expressions present new challenges to analyze text given the use of slang, orthographic and grammatical errors, among others. Along with these challenges, a practical sentiment classifier should be able to handle efficiently large workloads.The aim of this research is to identify which text transformations (lemmatization, stemming, entity removal, among others), tokenizers (e.g., words n-grams), and tokens weighting schemes impact the most the accuracy of a classifier (Support Vector Machine) trained on two Spanish corpus. The methodology used is to exhaustively analyze all the combinations of the text transformations and their respective parameters to find out which characteristics the best performing classifiers have in common. Furthermore, among the different text transformations studied, we introduce a novel approach based on the combination of word based n-grams and character based q-grams. The results show that this novel combination of words and characters produces a classifier that outperforms the traditional word based combination by 11.17% and 5.62% on the INEGI and TASS'15 dataset, respectively.
Recently, sentiment analysis has received a lot of attention due to the interest in mining opinions of social media users. Sentiment analysis consists in determining the polarity of a given text, i.e., its degree of positiveness or negativeness. Traditionally, Sentiment Analysis algorithms have been tailored to a specific language given the complexity of having a number of lexical variations and errors introduced by the people generating content. In this contribution, our aim is to provide a simple to implement and easy to use multilingual framework, that can serve as a baseline for sentiment analysis contests, and as starting point to build new sentiment analysis systems. We compare our approach in eight different languages, three of them have important international contests, namely, SemEval (English), TASS (Spanish), and SENTIPOLC (Italian). Within the competitions our approach reaches from medium to high positions in the rankings; whereas in the remaining languages our approach outperforms the reported results.
The objective of this text is to describe the three categories that the Drug Policy Program at the Center for Teaching and Research in Economics (CIDE-PPD) database comprises, their limitations, and their main features. Additionally, we explain what we believe to be the source of the database we originally received and analyze its accuracy by comparing it with public records. We describe the validation and codification processes the database was subjected to, as well as the main biases and limitations the database may have. Additionally, we offer a preliminary analysis of the type of research that the CIDE-PPD Database can support. This analysis is not only relevant to those interested in studying the “war on drugs” in Mexico but also to those studying conflict in other countries involved in illegal drug production and trafficking, as well as countries experiencing conflicts related to organized crime.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.