Proceedings of the Seventh Workshop on Noisy User-Generated Text (W-Nut 2021) 2021
DOI: 10.18653/v1/2021.wnut-1.49
|View full text |Cite
|
Sign up to set email alerts
|

BERTweetFR : Domain Adaptation of Pre-Trained Language Models for French Tweets

Abstract: We introduce BERTweetFR, the first largescale pre-trained language model for French tweets.Our model is initialized using the general-domain French language model CamemBERT (Martin et al., 2020) which follows the base architecture of BERT. Experiments show that BERTweetFR outperforms all previous general-domain French language models on two downstream Twitter NLP tasks of offensiveness identification and named entity recognition. The dataset used in the offensiveness detection task is first created and annota… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 12 publications
(6 citation statements)
references
References 16 publications
0
6
0
Order By: Relevance
“…To detect concerns, we fine-tune a BerTweetFr model (Guo et al 2021) to predict concerns from 10K human annotated data and train for 3 epochs with a batch size of 8. Each concern becomes a binary label prediction task, allowing for multiple concerns to be found in each tweet.…”
Section: Concern Detectionmentioning
confidence: 99%
“…To detect concerns, we fine-tune a BerTweetFr model (Guo et al 2021) to predict concerns from 10K human annotated data and train for 3 epochs with a batch size of 8. Each concern becomes a binary label prediction task, allowing for multiple concerns to be found in each tweet.…”
Section: Concern Detectionmentioning
confidence: 99%
“…The most common among them are CamemBERT [20] and FlauBERT [21]. The first is a more general model, while the latter better suits for the downstream tasks [22]. In addition, CamemBERT outperforms FlauBERT [22], [23].…”
Section: Related Work a Rule Identificationmentioning
confidence: 99%
“…The first is a more general model, while the latter better suits for the downstream tasks [22]. In addition, CamemBERT outperforms FlauBERT [22], [23]. We thus will use the former as the state-of-the-art approach implementation.…”
Section: Related Work a Rule Identificationmentioning
confidence: 99%
“…As confirmation of this statement, 4 out of 12 tasks in the SemEval 2023 competition 4 were based on Twitter data. Monolingual models have been trained on tweets in languages such as English (Nguyen et al, 2020), Arabic (Antoun et al, 2020;Abdelali et al, 2021), French (Guo et al, 2021), Hebrew (Seker et al, 2022), Indonesian (Koto et al, 2021), Italian (Polignano et al, 2019) and Spanish (González et al, 2021;Pérez et al, 2022). Some of them were initialized with weights of existing general-domain models and adapted to Twitter data by continued pre-training, while others were trained on Twitter data from scratch.…”
Section: Language Models For Social Mediamentioning
confidence: 99%