Background As the COVID-19 pandemic progressed, disinformation, fake news, and conspiracy theories spread through many parts of society. However, the disinformation spreading through social media is, according to the literature, one of the causes of increased COVID-19 vaccine hesitancy. In this context, the analysis of social media posts is particularly important, but the large amount of data exchanged on social media platforms requires specific methods. This is why machine learning and natural language processing models are increasingly applied to social media data. Objective The aim of this study is to examine the capability of the CamemBERT French-language model to faithfully predict the elaborated categories, with the knowledge that tweets about vaccination are often ambiguous, sarcastic, or irrelevant to the studied topic. Methods A total of 901,908 unique French-language tweets related to vaccination published between July 12, 2021, and August 11, 2021, were extracted using Twitter’s application programming interface (version 2; Twitter Inc). Approximately 2000 randomly selected tweets were labeled with 2 types of categorizations: (1) arguments for (pros) or against (cons) vaccination (health measures included) and (2) type of content (scientific, political, social, or vaccination status). The CamemBERT model was fine-tuned and tested for the classification of French-language tweets. The model’s performance was assessed by computing the F1-score, and confusion matrices were obtained. Results The accuracy of the applied machine learning reached up to 70.6% for the first classification (pro and con tweets) and up to 90% for the second classification (scientific and political tweets). Furthermore, a tweet was 1.86 times more likely to be incorrectly classified by the model if it contained fewer than 170 characters (odds ratio 1.86; 95% CI 1.20-2.86). Conclusions The accuracy of the model is affected by the classification chosen and the topic of the message examined. When the vaccine debate is jostled by contested political decisions, tweet content becomes so heterogeneous that the accuracy of the model drops for less differentiated classes. However, our tests showed that it is possible to improve the accuracy by selecting tweets using a new method based on tweet length.
BACKGROUND As the pandemic progressed, disinformation, fake news and conspiracy spread through many parts of society. However, the disinformation spreading through social media is, according to the literature, one of the causes of increased COVID-19 vaccine hesitancy. In this context, the analysis of social media is particularly important, but the large amount of data exchanged on social networks requires specific methods. This is why machine learning and natural language processing (NLP) models are increasingly used on social media data. OBJECTIVE The aim of this study is to examine the capability of the CamemBERT French language model to faithfully predict elaborated categories, with the knowledge that tweets about vaccination are often ambiguous, sarcastic or irrelevant to the studied topic. METHODS A total of 901,908 unique French tweets related to vaccination published between July 12, 2021, and August 11, 2021, were extracted from Twitter API v2. Approximately 2000 randomly selected tweets were labeled with two types of categorization: (1) arguments for (“pros”) or against (“cons”) vaccination (sanitary measures included) and (2) the type of content of tweets (“scientific”, “political”, “social”, or “vaccination status”). The CamemBERT model was fine-tuned and tested for the classification of French tweets. The model performance was assessed by computing the F1-score, and confusion matrices were obtained. RESULTS The accuracy of the applied machine learning reached up to 70.6% for the first classification (“pros” and “cons” tweets) and up to 90.0% for the second classification (“scientific” and “political” tweets). Furthermore, a tweet was 1.86 times more likely to be incorrectly classified by the model if it contained fewer than 170 characters (odds ratio = 1.86; 1.20 < 95% confidence interval < 2.86). CONCLUSIONS The accuracy is affected by the classification chosen and the topic of the message examined. When the vaccine debate is jostled by contested political decisions, tweet content becomes so heterogeneous that the accuracy of the models drops for less differentiated classes. However, our tests also showed that it is possible to improve the accuracy of the model by selecting tweets using a new method based on tweet size.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.