2021
DOI: 10.48550/arxiv.2112.08594
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Twitter-COMMs: Detecting Climate, COVID, and Military Multimodal Misinformation

Abstract: Detecting out-of-context media, such as "miscaptioned" images on Twitter, often requires detecting inconsistencies between the two modalities. This paper describes our approach to the Image-Text Inconsistency Detection challenge of the DARPA Semantic Forensics (SemaFor) Program. First, we collect Twitter-COMMs, a large-scale multimodal dataset with 884k tweets relevant to the topics of Climate Change, COVID-19, and Military Vehicles. We train our approach, based on the state-of-the-art CLIP model, leveraging a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
10
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(10 citation statements)
references
References 0 publications
0
10
0
Order By: Relevance
“…The first MMD approaches mostly relied on convolutional neural networks pre-trained on ImageNet to extract features from images (namely VGG-19 [13,26] and ResNet50 [5,22]) and word embeddings to extract features from captions (namely word2vec [13,26] and fastText [22]). More recent approaches have resorted to large-scale multimodal and cross-modal models, namely CLIP [7,12,20], VisualBERT [20] and VinVL [12] to extract both their visual and textual features. In the aforementioned works, CLIP [24] tended to outperform other cross-modal methods (VinVL and VisualBERT) for MMD [7,12,20].…”
Section: Methodological Framework 31 Problem Formulationmentioning
confidence: 99%
See 4 more Smart Citations
“…The first MMD approaches mostly relied on convolutional neural networks pre-trained on ImageNet to extract features from images (namely VGG-19 [13,26] and ResNet50 [5,22]) and word embeddings to extract features from captions (namely word2vec [13,26] and fastText [22]). More recent approaches have resorted to large-scale multimodal and cross-modal models, namely CLIP [7,12,20], VisualBERT [20] and VinVL [12] to extract both their visual and textual features. In the aforementioned works, CLIP [24] tended to outperform other cross-modal methods (VinVL and VisualBERT) for MMD [7,12,20].…”
Section: Methodological Framework 31 Problem Formulationmentioning
confidence: 99%
“…OOC involves pairing an image with an incongruous caption [5] while NEI involves manipulating the named entities in otherwise truthful captions [22]. Previous works have relied on random sampling [5,13] or feature-informed sampling methods [7,20] for generating OOC and in-cluster random sampling [26] or rule-based random sampling [22] for generating NEI. Examples of generated OOC and NEI misinformation can be seen in Fig.…”
Section: M Dmentioning
confidence: 99%
See 3 more Smart Citations