Findings of the Association for Computational Linguistics: NAACL 2022 2022
DOI: 10.18653/v1/2022.findings-naacl.39
|View full text |Cite
|
Sign up to set email alerts
|

Fine-grained Image Captioning with CLIP Reward

Abstract: Modern image captioning models are usually trained with text similarity objectives. However, since reference captions in public datasets often describe the most salient common objects, models trained with the text similarity objectives tend to ignore specific and detailed aspects of an image that distinguish it from others. Towards more descriptive and distinctive caption generation, we propose to use CLIP, a multimodal encoder trained on huge image-text pairs from the web, to calculate multi-modal similarity … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
8
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 28 publications
(8 citation statements)
references
References 11 publications
0
8
0
Order By: Relevance
“…Unsupervised metrics directly capture the similarity between input images and generated sentences. A typical metric, CLIP-S [53], quantifies the cosine similarity between features of images and sentence extracted from pretrained CLIP model.…”
Section: Baseline Methods and Evaluation Metricsmentioning
confidence: 99%
See 2 more Smart Citations
“…Unsupervised metrics directly capture the similarity between input images and generated sentences. A typical metric, CLIP-S [53], quantifies the cosine similarity between features of images and sentence extracted from pretrained CLIP model.…”
Section: Baseline Methods and Evaluation Metricsmentioning
confidence: 99%
“…Recent research [53] demonstrates that matching scores the image and text based on CLIP can increase the diversity and accuracy of captions generated. This enhancement is achieved through direct measurement of the correlation between the input image and the generated sentences.…”
Section: Semantic Consistencymentioning
confidence: 99%
See 1 more Smart Citation
“…Among the approaches more closely related to ours, the system of Yu et al [51] uses a ClipCap-like system for caption generation, and CLIP to measure image-caption similarity, focusing on generating captions in multiple styles. Cho et al [8] use CLIP to finetune a pre-trained captioner. Like [51], they use the CLIP-Score image-caption similarity measure [16] as a reward signal, rather than a discriminative objective like the one we adopt.…”
Section: Related Workmentioning
confidence: 99%
“…Deep learning-based vision-language models (VLMs) [45] unify text and visual data into a common representation and reduce the computing cost of training for specific computer vision [19] and visually-grounded linguistic tasks [12,47]. VLMs are trained with a large amount of data with the aim of matching image and text representations for image-caption pairs to capture diverse visual and linguistic concepts.…”
Section: Introductionmentioning
confidence: 99%