BERTweetFR : Domain Adaptation of Pre-Trained Language Models for French Tweets

Guo, Yanzhu; Rennard, Virgile; Xypolopoulos, Christos; Vazirgiannis, Michalis

doi:10.18653/v1/2021.wnut-1.49

Cited by 12 publications

(6 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To detect concerns, we fine-tune a BerTweetFr model (Guo et al 2021) to predict concerns from 10K human annotated data and train for 3 epochs with a batch size of 8. Each concern becomes a binary label prediction task, allowing for multiple concerns to be found in each tweet.…”

Section: Concern Detectionmentioning

confidence: 99%

Socio-Linguistic Characteristics of Coordinated Inauthentic Accounts

Burghardt,

Rao,

Chochlakis

et al. 2024

ICWSM

View full text Add to dashboard Cite

Online manipulation is a pressing concern for democracies, but the actions and strategies of coordinated inauthentic accounts, which have been used to interfere in elections, are not well understood. We analyze a five million-tweet multilingual dataset related to the 2017 French presidential election, when a major information campaign led by Russia called "#MacronLeaks" took place. We utilize heuristics to identify coordinated inauthentic accounts and detect attitudes, concerns and emotions within their tweets, collectively known as socio-linguistic characteristics. We find that coordinated accounts retweet other coordinated accounts far more than expected by chance, while being exceptionally active just before the second round of voting. Concurrently, socio-linguistic characteristics reveal that coordinated accounts share tweets promoting a candidate at three times the rate of non-coordinated accounts. Coordinated account tactics also varied in time to reflect news events and rounds of voting. Our analysis highlights the utility of socio-linguistic characteristics to inform researchers about tactics of coordinated accounts and how these may feed into online social manipulation.

show abstract

Section: Concern Detectionmentioning

confidence: 99%

Socio-Linguistic Characteristics of Coordinated Inauthentic Accounts

Burghardt,

Rao,

Chochlakis

et al. 2024

ICWSM

View full text Add to dashboard Cite

show abstract

“…The most common among them are CamemBERT [20] and FlauBERT [21]. The first is a more general model, while the latter better suits for the downstream tasks [22]. In addition, CamemBERT outperforms FlauBERT [22], [23].…”

Section: Related Work a Rule Identificationmentioning

confidence: 99%

“…The first is a more general model, while the latter better suits for the downstream tasks [22]. In addition, CamemBERT outperforms FlauBERT [22], [23]. We thus will use the former as the state-of-the-art approach implementation.…”

Section: Related Work a Rule Identificationmentioning

confidence: 99%

Towards a (Semi-)Automatic Urban Planning Rule Identification in the French Language

Koptelov,

Holveck,

Cremilleux

et al. 2023

2023 IEEE 10th International Conference on Data Science and Advanced Analytics (DSAA)

View full text Add to dashboard Cite

One of the objectives of the Hérelles project is to find new mechanisms to facilitate the labeling (or semantization) of clusters from time series of satellite images. To achieve this, a proposed solution is to associate textual elements of interest with satellite data. The first step in this process consists of an automatic extraction of the information in the form of rules from urban planning documents composed in the French language. To address this challenge, we propose a method which is based on the multi-label classification of textual segments. It includes a special format for representing segments, in which each segment has a title and a subtitle. In addition, we propose a cascade approach aiming to deal with hierarchy of class labels. Finally, we develop several text augmentation techniques for the texts in French, which are able to improve the prediction results. We demonstrate experimentally that the resulting framework correctly classifies each type of segment with more than 90% of accuracy.

show abstract

“…As confirmation of this statement, 4 out of 12 tasks in the SemEval 2023 competition 4 were based on Twitter data. Monolingual models have been trained on tweets in languages such as English (Nguyen et al, 2020), Arabic (Antoun et al, 2020;Abdelali et al, 2021), French (Guo et al, 2021), Hebrew (Seker et al, 2022), Indonesian (Koto et al, 2021), Italian (Polignano et al, 2019) and Spanish (González et al, 2021;Pérez et al, 2022). Some of them were initialized with weights of existing general-domain models and adapted to Twitter data by continued pre-training, while others were trained on Twitter data from scratch.…”

Section: Language Models For Social Mediamentioning

confidence: 99%

TrelBERT: A pre-trained encoder for Polish Twitter

Szmyd,

Kotyla,

Zobniów

et al. 2023

Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023)

View full text Add to dashboard Cite

Pre-trained Transformer-based models have become immensely popular amongst NLP practitioners. We present TrelBERT -the first Polish language model suited for application in the social media domain. TrelBERT is based on an existing general-domain model and adapted to the language of social media by pre-training it further on a large collection of Twitter data. We demonstrate its usefulness by evaluating it in the downstream task of cyberbullying detection, in which it achieves state-of-the-art results, outperforming larger monolingual models trained on general-domain corpora, as well as multilingual in-domain models, by a large margin. We make the model publicly available. We also release a new dataset for the problem of harmful speech detection.

show abstract

BERTweetFR : Domain Adaptation of Pre-Trained Language Models for French Tweets

Cited by 12 publications

References 16 publications

Socio-Linguistic Characteristics of Coordinated Inauthentic Accounts

Socio-Linguistic Characteristics of Coordinated Inauthentic Accounts

Towards a (Semi-)Automatic Urban Planning Rule Identification in the French Language

TrelBERT: A pre-trained encoder for Polish Twitter

Contact Info

Product

Resources

About