2022
DOI: 10.1609/icwsm.v16i1.19375
|View full text |Cite
|
Sign up to set email alerts
|

A Large-Scale Longitudinal Multimodal Dataset of State-Backed Information Operations on Twitter

Abstract: This paper proposes a large-scale and comprehensive dataset of 28 sub-datasets of state-backed tweets and accounts affiliated with 14 different countries, spanning more than 3 years, and a corresponding "negative" dataset of background tweets from the same time period and on similar topics. To our knowledge, this is the first dataset that contains both state-sponsored propaganda tweets and carefully collected corresponding negative tweet datasets for so many countries spanning such a long period of time.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(3 citation statements)
references
References 8 publications
0
3
0
Order By: Relevance
“…Then, we collected all tweets that included at least one of such hashtags that were published during our 4 months observation window by non-banned users. We chose to rely on hashtags, which are the most widely used method of performing content-based Twitter data collection and filtering [4], [60]. Based on the ranked lists of each IO's top hashtags, our data collection step stops upon collecting data about a few hundreds of thousands of genuine users [9], [43].…”
Section: B Genuine Usersmentioning
confidence: 99%
“…Then, we collected all tweets that included at least one of such hashtags that were published during our 4 months observation window by non-banned users. We chose to rely on hashtags, which are the most widely used method of performing content-based Twitter data collection and filtering [4], [60]. Based on the ranked lists of each IO's top hashtags, our data collection step stops upon collecting data about a few hundreds of thousands of genuine users [9], [43].…”
Section: B Genuine Usersmentioning
confidence: 99%
“…The dataset is used for building a series of classification models from which SVM showed the best performance (F1-score is 93.32%). Moreover, Guo & Vosoughi (2022) build a large-scale dataset that contains sub-datasets with tweets and accounts affiliated with 14 countries spanning more than 3 years useful for tasks like analyzing state-sponsored propaganda.…”
Section: Background and Related Researchmentioning
confidence: 99%
“…Existing research into suspended accounts on Twitter spans various settings, from political elections to social movements [ 26 28 ]: a limitation to these studies is that their analyses are retrospective, i.e., done by querying Twitter’s Application Programming Interface (API) with a considerable delay (sometimes years) with respect to the period of activity of observed users, hence preventing the determination of when and why accounts were actually suspended, information that are not disclosed by the platform. As labeled ground truth data about legitimate and abusive accounts is not always available, researchers typically rely on labeled datasets that emerged during audits or investigations into these platforms [ 29 ]. One such example is the case of the Russian Internet Research Agency state-controlled accounts [ 30 ], whose Twitter handles were released by the US Congress; however, Twitter never disclosed how they identified such malicious actors.…”
Section: Introductionmentioning
confidence: 99%