Codeswitching is a widely observed phenomenon among bilingual speakers. By combining subword information enriched word vectors with linear-chain Conditional Random Field, we develop a supervised machine learning model that identifies languages in a English-Spanish codeswitched tweets. Our computational method achieves a tweet-level weighted F1 of 0.83 and a token-level accuracy of 0.949 without using any external resource. The result demonstrates that named entity recognition remains a challenge in codeswitched texts and warrants further work.
Pinyin is the most widely used romanization scheme for Mandarin Chinese. We consider the task of language identification in Pinyin-English codeswitched texts, a task that is significant because of its application to codeswitched text input. We create a codeswitched corpus by extracting and automatically labeling existing Mandarin-English codeswitched corpora. On language identification, we find that SVM produces the best result when using word-level segmentation, achieving 99.3% F1 on a Weibo dataset, while a linear-chain CRF produces the best result at the letter level, achieving 98.2% F1. We then pass the output of our models to a system that converts Pinyin back to Chinese characters to simulate codeswitched text input. Our method achieves the same level of performance as an oracle system that has perfect knowledge of token-level language identity. This result demonstrates that Pinyin identification is not the bottleneck towards developing a Chinese-English codeswitched Input Method Editor, and future work should focus on the Pinyinto-Chinese character conversion step.
In this paper, based on Latent Dirichlet Allocation (LDA), we propose a novel probabilistic modeling framework, which aims to reveal the latent aspects and sentiments of reviews simultaneously. Unlike other topic models which only consider the words appearing in online reviews, we consider Part-of-Speech (POS) tags in our model. Since users may use different types of words to express different meanings, we have proposed two Tag Sentiment Aspect models (TSA) to integrate syntactical information into the review mining models. We have applied the proposed models to two datasets, electronic product reviews and movie reviews, and evaluated the results in terms of sentiment aspect extraction and sentiment polarity classification. Our study shows that the proposed models not only achieve promising results on sentiment classification, but also effectively extract different latent sentiment aspects. Moreover, the proposed TSA models are fully unsupervised, and they do not need any manually labeled reviews for training. To incorporate priors, only the lists of positive and negative words are required. Moreover, the proposed TSA models are effective across different domain.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.