Twitter-COMMs: Detecting Climate, COVID, and Military Multimodal Misinformation

Biamby, Giscard; Luo, Grace; Darrell, Trevor; Rohrbach, Anna

doi:10.48550/arxiv.2112.08594

Cited by 1 publication

(10 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The first MMD approaches mostly relied on convolutional neural networks pre-trained on ImageNet to extract features from images (namely VGG-19 [13,26] and ResNet50 [5,22]) and word embeddings to extract features from captions (namely word2vec [13,26] and fastText [22]). More recent approaches have resorted to large-scale multimodal and cross-modal models, namely CLIP [7,12,20], VisualBERT [20] and VinVL [12] to extract both their visual and textual features. In the aforementioned works, CLIP [24] tended to outperform other cross-modal methods (VinVL and VisualBERT) for MMD [7,12,20].…”

Section: Methodological Framework 31 Problem Formulationmentioning

confidence: 99%

“…OOC involves pairing an image with an incongruous caption [5] while NEI involves manipulating the named entities in otherwise truthful captions [22]. Previous works have relied on random sampling [5,13] or feature-informed sampling methods [7,20] for generating OOC and in-cluster random sampling [26] or rule-based random sampling [22] for generating NEI. Examples of generated OOC and NEI misinformation can be seen in Fig.…”

Section: M Dmentioning

confidence: 99%

“…One common form of multimodal misinformation involves de-contextualization; a legitimate image being paired with an out-of-context caption creating a deceptive impression. Consequently, researchers have used random sampling ( [5,13]) and feature-informed sampling methods ( [7,20]) for generating OOCs. The MAIM dataset was created by randomly sampling image-text pairs from image-caption pairs collected from Flickr [13].…”

Section: Related Workmentioning

confidence: 99%

“…To this end, Luo et al [20] created the NewsCLIPings dataset by utilizing the large cross-modal CLIP model [24] along with scene-learning and person matching models in order to generate hard negative samples. Similarly, the Twitter-COMMs dataset was created by combining and applying CLIP-based sampling (to generate hard negatives) and in-topic random sampling (to resolve class imbalance) on data collected from Twitter, related to three topics: climate, COVID, and military vehicles [7].…”

Section: Related Workmentioning

confidence: 99%

“…The above works either provide internal ablation [7,20,22,26] or comparison with simple baselines [26] and do not compare their methods with other Synthetic Misinformers. Moreover, studies that used some of the above datasets have mostly focused on incremental methodological improvements [2], such as the integration of sentiment [4] or evidence [1] in MMD models.…”

Section: Related Workmentioning

confidence: 99%

See 4 more Smart Citations

Synthetic Misinformers: Generating and Combating Multimodal Misinformation

Papadopoulos

Koutlis

Papadopoulos

et al. 2023

Proceedings of the 2nd ACM International Workshop on Multimedia AI Against Disinformation

View full text Add to dashboard Cite

With the expansion of social media and the increasing dissemination of multimedia content, the spread of misinformation has become a major concern. This necessitates effective strategies for multimodal misinformation detection (MMD) that detect whether the combination of an image and its accompanying text could mislead or misinform. Due to the data-intensive nature of deep neural networks and the labor-intensive process of manual annotation, researchers have been exploring various methods for automatically generating synthetic multimodal misinformation -which we refer to as Synthetic Misinformers -in order to train MMD models. However, limited evaluation on real-world misinformation and a lack of comparisons with other Synthetic Misinformers makes difficult to assess progress in the field. To address this, we perform a comparative study on existing and new Synthetic Misinformers that involves (1) out-of-context (OOC) image-caption pairs, (2) crossmodal named entity inconsistency (NEI) as well as (3) hybrid approaches and we evaluate them against real-world misinformation; using the COSMOS benchmark. The comparative study showed that our proposed CLIP-based Named Entity Swapping can lead to MMD models that surpass other OOC and NEI Misinformers in terms of multimodal accuracy and that hybrid approaches can lead to even higher detection accuracy. Nevertheless, after alleviating information leakage from the COSMOS evaluation protocol, low Sensitivity scores indicate that the task is significantly more challenging than previous studies suggested. Finally, our findings showed that NEIbased Synthetic Misinformers tend to suffer from a unimodal bias, where text-only models can outperform multimodal ones.

show abstract