SynthCap: Augmenting Transformers with Synthetic Data for Image Captioning

Caffagni, Davide; Barraco, Manuele; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

doi:10.1007/978-3-031-43148-7_10

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2024

Publication Types

Select...

Other1

Article1

Relationship

Self Cite0

Independent2

Authors

Journals

Cited by 2 publications

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Common Canvas: Open Diffusion Models Trained on Creative-Commons Images

Gokaslan,

Cooper,

Collins

et al. 2024

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

Common Canvas: Open Diffusion Models Trained on Creative-Commons Images

Gokaslan,

Cooper,

Collins

et al. 2024

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

Parents and Children: Distinguishing Multimodal Deepfakes from Natural Images

Amoroso,

Morelli,

Cornia

et al. 2024

ACM Trans. Multimedia Comput. Commun. Appl.

View full text Add to dashboard Cite

Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language. While these models have numerous benefits across various sectors, they have also raised concerns about the potential misuse of fake images and cast new pressures on fake image detection. In this work, we pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models. Firstly, we conduct a comprehensive analysis of the performance of contrastive and classification-based visual features, respectively extracted from CLIP-based models and ResNet or ViT-based architectures trained on image classification datasets. Our results demonstrate that fake images share common low-level cues, which render them easily recognizable. Further, we devise a multimodal setting wherein fake images are synthesized by different textual captions, which are used as seeds for a generator. Under this setting, we quantify the performance of fake detection strategies and introduce a contrastive-based disentangling method that lets us analyze the role of the semantics of textual descriptions and low-level perceptual cues. Finally, we release a new dataset, called COCOFake, containing about 1.2M images generated from the original COCO image-caption pairs using two recent text-to-image diffusion models, namely Stable Diffusion v1.4 and v2.0.

show abstract

SynthCap: Augmenting Transformers with Synthetic Data for Image Captioning

Abstract: The terms and conditions for the reuse of this version of the manuscript are specified in the publishing policy. For all terms of use and more information see the publisher's website.

Cited by 2 publications

References 46 publications

Common Canvas: Open Diffusion Models Trained on Creative-Commons Images

Common Canvas: Open Diffusion Models Trained on Creative-Commons Images

Parents and Children: Distinguishing Multimodal Deepfakes from Natural Images

Contact Info

Product

Resources

About