BACKGROUND
The COVID-19 pandemic, ongoing for over three years, has led to significant global impacts, with 1.1 million deaths in the US and 6,180,000 worldwide due to COVID-19. COVID-19 vaccines, introduced in late 2020, have been endorsed by public health authorities worldwide as a vital defense against severe illness and death. Social media platforms, particularly Twitter, offer valuable insights into public opinions and responses regarding COVID-19 vaccination. Many studies have employed machine learning to analyze extensive Twitter data. Sentiment analysis tools like Vader and TextBlob have been extensively used to gauge sentiment towards vaccines, creating reference datasets for text classification models. This research examines the reliability of this approach and introduces an alternative solution based on few-shot learning for building more robust sentiment classification models.
OBJECTIVE
The goal is to determine the reliability of Vader and TextBlob and assess their suitability for generating robust gold standard datasets for downstream tasks such as sentiment classification compared to standard machine learning-based classifiers as well as advanced pre-trained language models.
METHODS
We applied Vader and TextBlob to three Twitter-based datasets labeled by humans to obtain the sentiments of the tweets (positive, negative, or neutral). We compared the performance of Vader and TextBlob to traditional machine learning-based sentiment classifiers including Random Forest, Logistic Regression, Stochastic Gradient Descent, Multinomial Naive Bayes, Deep Neural Network (DNN), and few-shot learning-based classifiers. The few-shot learning models are based on pretraining and fine tuning (PET) models including BERT-Base, BERT-large, and CT-BERT. We evaluated performance using F-score, precision, recall and AUC.
RESULTS
The findings revealed that Vader and TextBlob performed poorly compared to the other methods. This suggests that relying on sentiment tools like Vader and TextBlob to establish gold standards for sentiment classification models may result in inadequate and weak models that perform poorly on new datasets. All classifiers achieved better F1-scores over the different datasets, with 28% and 8% improvement over Vader and TextBlob. The few-shot learning approach excelled, particularly when dealing with smaller datasets, showcasing a minimum 15% F1-Score improvement with just 10% of labeled data. Few-shot learning consistently outperformed the other models across various training sample sizes. The best F1-Scores reached 0.71, 0.63, and 0.78, across the different datasets, when CT-BERT served as the PET model.
CONCLUSIONS
This study highlights the limitations of sentiment analysis tools like Vader and TextBlob when used to generate gold standard datasets. These tools may not produce robust and reliable models for sentiment classification. The research proposes an alternative approach using few-shot learning, which can achieve equivalent or better performance with minimal labeled data. This approach provides a promising path for improving sentiment classification models in the context of healthcare and other domains where understanding public sentiment is crucial.