Refining Word Embeddings for Sentiment Analysis

Yu, Liang-Chih; Jin, Wang; Lai, K. Robert; Zhang, Xuejie

doi:10.18653/v1/d17-1056

Cited by 148 publications

(72 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Past studies have tried to incorporate sentiment during the training process of the embedding, 17,18 concatenation of pretrained embedding with additional linguistic features, 19 and refinement of the pretrained embedding. 20 Here we incorporated a polarity one-dimensional vector (Fig 1B). We built the dictionary on the basis of a previous lexicon with known sentiments 21 and manually added the words “plus” and “minus.” These added words do not exist in our medical data set and were later used to validate our out-of-vocabulary predictions.…”

Section: Methodsmentioning

confidence: 99%

Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing

Zhao¹

2019

JCO Clinical Cancer Informatics

View full text Add to dashboard Cite

PURPOSEA substantial portion of medical data is unstructured. Extracting data from unstructured text presents a barrier to advancing clinical research and improving patient care. In addition, ongoing studies have been focused predominately on the English language, whereas inflected languages with non-Latin alphabets (such as Slavic languages with a Cyrillic alphabet) present numerous linguistic challenges. We developed deep-learning–based natural language processing algorithms for automatically extracting biomarker status of patients with breast cancer from three oncology centers in Bulgaria.METHODSWe used dual embeddings for English and Bulgarian languages, encoding both syntactic and polarity information for the words. The embeddings were subsequently aligned so that they were in the same vector space. The embeddings were used as input to convolutional or recurrent neural networks to derive the biomarker status of estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2.RESULTSWe showed that we can resolve ambiguity in highly variable medical text containing both Latin and Cyrillic text. Final models incorporating both English and Bulgarian syntax and polarity embeddings achieved F1 scores of 0.90 or higher for all estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2 biomarkers. The models were robust against human errors originally found in the training set. In addition, such models can be extended for analyzing text containing words not seen during training.CONCLUSIONBy using several techniques that incorporate dual-word embeddings encoding syntactic and polarity information in two languages followed by deep neural network architectures, we show that researchers can extract and normalize parameters within medical data. The principles described here can be used to analyze Cyrillic or Latin mixed medical text and extract other parameters.

show abstract

Section: Methodsmentioning

confidence: 99%

Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing

Zhao¹

2019

JCO Clinical Cancer Informatics

View full text Add to dashboard Cite

show abstract

“…Compared with GloVe+DeepMoji, GloVe+Emo2Vec achieves same or better results on 11/14 datasets, which on average gives 1.0% improvement. GloVe+Emo2Vec achieves better performances on SOTA results on three datasets (SE0714, stress and tube tablet) and comparable result to SOTA on dataset Previous SOTA results GloVe GloVe+DeepMoji GloVe+Emo2Vec SS-Twitter bi-LSTM (Felbo et al, 2017) 0.88 0.78 0.81 0.81 SS-Youtube bi-LSTM (Felbo et al, 2017) 0.93 0.84 0.86 0.87 SS-binary bi-LSTM (Yu et al, 2017) another four datasets (tube auto, SemEval, SCv1-GEN and SCv2-GEN). We believe the reason why we achieve a much better performance than SOTA on the SE0714 is that headlines are usually short and emotional words exist more commonly in headlines.…”

Section: Resultsmentioning

confidence: 80%

Emo2Vec: Learning Generalized Emotion Representation by Multi-task Training

Xu¹,

Madotto²,

Wu³

et al. 2018

Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

View full text Add to dashboard Cite

In this paper, we propose Emo2Vec which encodes emotional semantics into vectors. We train Emo2Vec by multi-task learning six different emotion-related tasks, including emotion/sentiment analysis, sarcasm classification, stress detection, abusive language classification, insult detection, and personality recognition. Our evaluation of Emo2Vec shows that it outperforms existing affect-related representations, such as Sentiment-Specific Word Embedding and DeepMoji embeddings with much smaller training corpora. When concatenated with GloVe, Emo2Vec achieves competitive performances to state-of-the-art results on several tasks using a simple logistic regression classifier.

show abstract

“…There are recent studies that aims to project not only semantic and syntactic but also sentiment content of text before creating a model [15], [16]. [17] emphasizes the same problem with an approach distinctly using existing word embedding model.…”

Section: Training Word2vec Model Resultsmentioning

confidence: 99%

Sentiment analysis using a random forest classifier on turkish web comments

Pervan¹,

Keleş²

2017

Communications Faculty of Science University of Ankara

View full text Add to dashboard Cite

Abstract. Sentiment analysis is an active research area since early 2000s as a field of text classification. Most of the studies in this field focus on the analysis using the text in English language, where the Turkish and the other languages have fallen behind. The purpose of this research is to contribute to the text analysis in Turkish language using the contents that we access through web sites. In particular, we deduce the sentiment behind noisy product reviews and comments in a highly popular commercial web page. In this context, we generate a unique dataset that includes 9100 product review samples for training our classification model. There are different word representation methods that are utilized in sentiment analysis, such as bag-ofwords and n-gram models. In this work, we generated our word models using the word2vec algorithm. In this model, each word in the vocabulary is represented as a vector of 300 dimensions. We utilize 70% of our dataset in the training of a Random Forest Model and make binary classification of sentiments as being positive or negative, utilizing the ratings of the user for the product as classification labels. In the highly noisy and unfiltered comments, we achieve an accuracy of 84.23%.

show abstract

Refining Word Embeddings for Sentiment Analysis

Cited by 148 publications

References 18 publications

Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing

Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing

Emo2Vec: Learning Generalized Emotion Representation by Multi-task Training

Sentiment analysis using a random forest classifier on turkish web comments

Contact Info

Product

Resources

About