Findings of the Association for Computational Linguistics: EMNLP 2020 2020
DOI: 10.18653/v1/2020.findings-emnlp.148
|View full text |Cite
|
Sign up to set email alerts
|

TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification

Abstract: The experimental landscape in natural language processing for social media is too fragmented. Each year, new shared tasks and datasets are proposed, ranging from classics like sentiment analysis to irony detection or emoji prediction. Therefore, it is unclear what the current state of the art is, as there is no standardized evaluation protocol, neither a strong set of baselines trained on such domainspecific data. In this paper, we propose a new evaluation framework (TWEETEVAL) consisting of seven heterogeneou… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
205
1
3

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 352 publications
(282 citation statements)
references
References 24 publications
1
205
1
3
Order By: Relevance
“…Additionally, we excluded common English positive words that have been found to be negatively correlated with well-being and happiness at the regional level, namely "love", "good", "LOL", "better", "well" and "like" (Jaidka et al, 2020). We test the robustness of the LIWC analysis by comparing with the results of a machine learning model for emotion categorization (Barbieri et al, 2020) based on a RoBERTa (Liu et al, 2019) supervised classifier trained with the annotated tweets from the SemEval dataset (Mohammad et al, 2018).…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Additionally, we excluded common English positive words that have been found to be negatively correlated with well-being and happiness at the regional level, namely "love", "good", "LOL", "better", "well" and "like" (Jaidka et al, 2020). We test the robustness of the LIWC analysis by comparing with the results of a machine learning model for emotion categorization (Barbieri et al, 2020) based on a RoBERTa (Liu et al, 2019) supervised classifier trained with the annotated tweets from the SemEval dataset (Mohammad et al, 2018).…”
Section: Methodsmentioning
confidence: 99%
“…For two example countries (USA, Spain), we checked the robustness of our dictionary-based emotion measures by comparing it with measures based on a machine learning model for emotion categorization. For this, we finetuned the pretrained masked language model from the TweetEval Benchmark (Barbieri et al, 2020) -a model based on RoBERTa-base (Liu et al, 2019) -on English and Spanish tweets from the SemEval dataset (Mohammad et al, 2018). We then correlated the percentage of emotional tweets based on dictionary-based (anxiety, anger, sadness, positive LIWC) and machine-learning based emotion labels (fear, anger, sadness and joy + optimism + love predicted by RoBERTa).…”
Section: Duration Of Changes In Emotional Expressionmentioning
confidence: 99%
“…Comparison with SOTA The SST-2, SE17-4A, SM20-5 tasks have been deployed on GLUE, TweetEval, and Codalab benchmarks respectively, therefore, the state of the art (SOTA) results are available. Current SOTA performances on SST-2 are obtained by (Sun et al, 2019) and (Raffel et al, 2020) (tied), SOTA for SE17-4A is reported by (Barbieri et al, 2020), and SOTA for SM20-5 is reported by (Bai and Zhou, 2020) as shown in Figure 10.…”
Section: All Modules Activementioning
confidence: 87%
“…The tweets are classified as Negative, Neutral, and Positive. The performance for this task is measured by the macro-average of recall scores for positive, negative, and neutral classes and evaluated by the TweetEval benchmark (Barbieri et al, 2020) 5 .…”
Section: Sentiment Analysismentioning
confidence: 99%
“…TWEETEVAL (Barbieri et al, 2020) is a pretrained RoBERTa base model, further trained with 60M tweets, randomly collected, resulting in a Twitter-domain adapted version. We use a selection of four TWEETEVAL models, each fine tuned for a twitter-specific downstream task: hate speech-, emotion-and irony-detection, and offensive language identification.…”
Section: Bertweetmentioning
confidence: 99%