Fine-grained Image Captioning with CLIP Reward

Cho, Jaemin; Yoon, Seunghyun; Kale, Ajinkya; Dernoncourt, Franck; Bui, Trung; Bansal, Mohit

doi:10.18653/v1/2022.findings-naacl.39

Cited by 28 publications

(8 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Unsupervised metrics directly capture the similarity between input images and generated sentences. A typical metric, CLIP-S [53], quantifies the cosine similarity between features of images and sentence extracted from pretrained CLIP model.…”

Section: Baseline Methods and Evaluation Metricsmentioning

confidence: 99%

“…Recent research [53] demonstrates that matching scores the image and text based on CLIP can increase the diversity and accuracy of captions generated. This enhancement is achieved through direct measurement of the correlation between the input image and the generated sentences.…”

Section: Semantic Consistencymentioning

confidence: 99%

“…While calculating the loss function L, it is important to note that some operations involved in the process are non-differentiable, such as sampling the probability distribution to retrieve words. Thus, to ensure the update of the gradient, following [54,53], we optimize the model with REINFORCE algorithm [55] with a self-critical baseline.…”

Section: Total Lossmentioning

confidence: 99%

See 2 more Smart Citations

Blind Hyperspectral Image Denoising with Degradation Information Learning

2023

View full text Add to dashboard Cite

Although existing hyperspectral image (HSI) denoising methods have exhibited promising performance in synthetic noise removal, they are seriously restricted in real-world scenarios with complicated noises. The major reason is that model-based methods largely rely on the noise type assumption and parameter setting, and learning-based methods perform poorly in generalizability due to the scarcity of real-world clean–noisy data pairs. To overcome this long-standing challenge, we propose a novel denoising method with degradation information learning (termed DIBD), which attempts to approximate the joint distribution of the clean–noisy HSI pairs in a Bayesian framework. Specifically, our framework learns the mappings of noisy-to-clean and clean-to-noisy in a priority dual regression scheme. We develop more comprehensive auxiliary information to simplify the joint distribution approximation process instead of only estimating noise intensity. Our method can leverage both labeled synthetic and unlabeled real data for learning. Extensive experiments show that the proposed DIBD achieves state-of-the-art performance on synthetic datasets and has better generalization to real-world HSIs. The source code will be available to the public.

show abstract

Section: Baseline Methods and Evaluation Metricsmentioning

confidence: 99%

Section: Semantic Consistencymentioning

confidence: 99%

See 1 more Smart Citation

Blind Hyperspectral Image Denoising with Degradation Information Learning

2023

View full text Add to dashboard Cite

show abstract

“…Among the approaches more closely related to ours, the system of Yu et al [51] uses a ClipCap-like system for caption generation, and CLIP to measure image-caption similarity, focusing on generating captions in multiple styles. Cho et al [8] use CLIP to finetune a pre-trained captioner. Like [51], they use the CLIP-Score image-caption similarity measure [16] as a reward signal, rather than a discriminative objective like the one we adopt.…”

Section: Related Workmentioning

confidence: 99%

Cross-Domain Image Captioning with Discriminative Finetuning

Dessì,

Bevilacqua,

Gualdoni

et al. 2023

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

Neural captioners are typically trained to mimic humangenerated references without optimizing for any specific communication goal, leading to problems such as the generation of vague captions. In this paper, we show that fine-tuning an out-of-the-box neural captioner with a selfsupervised discriminative communication objective helps to recover a plain, visually descriptive language that is more informative about image contents. Given a target image, the system must learn to produce a description that enables an out-of-the-box text-conditioned image retriever to identify such image among a set of candidates. We experiment with the popular ClipCap captioner, also replicating the main results with BLIP. In terms of similarity to groundtruth human descriptions, the captions emerging from discriminative finetuning lag slightly behind those generated by the non-finetuned model, when the latter is trained and tested on the same caption dataset. However, when the model is used without further tuning to generate captions for out-of-domain datasets, our discriminatively-finetuned captioner generates descriptions that resemble human references more than those produced by the same captioner without finetuning. We further show that, on the Conceptual Captions dataset, discriminatively finetuned captions are more helpful than either vanilla ClipCap captions or ground-truth captions for human annotators tasked with an image discrimination task. 1 10 As the focus of this analysis in on the Conceptual Captionstrained/finetuned models, we will drop the -ConCap suffix.

show abstract

“…Deep learning-based vision-language models (VLMs) [45] unify text and visual data into a common representation and reduce the computing cost of training for specific computer vision [19] and visually-grounded linguistic tasks [12,47]. VLMs are trained with a large amount of data with the aim of matching image and text representations for image-caption pairs to capture diverse visual and linguistic concepts.…”

Section: Introductionmentioning

confidence: 99%

DeAR: Debiasing Vision-Language Models with Additive Residuals

Seth¹,

Hemani²,

Agarwal³

2023

Preprint

View full text Add to dashboard Cite

Before Debiasing After Debiasing "Doctor" "Nurse" CLIP Image Encoder CLIP Text Encoder "photo of a doctor" cos cos 0.237 0.241 ≈ 0.239 w/o Debiasing (B) Zero-shot Object Detection with CLIP Bias in VLMs < 0.244 With DeAR Figure 1. We present DEAR -a framework to de-bias large Vision-Language models (VLM) like CLIP [45], exhibited in the skewed similarity between specific language concepts and images of people of certain visual characteristics. (A) Attribution maps from the DEAR-augmented CLIP model indicate how the attribution for a text concept shifts from the person's facial characteristics to objective cues in the image. (B) Results for zero-shot object detection with CLIP-ODS [48] that uses CLIP before and after debiasing show a clear improvement in the fairness of its detection results.

show abstract

Fine-grained Image Captioning with CLIP Reward

Cited by 28 publications

References 11 publications

Blind Hyperspectral Image Denoising with Degradation Information Learning

Blind Hyperspectral Image Denoising with Degradation Information Learning

Cross-Domain Image Captioning with Discriminative Finetuning

DeAR: Debiasing Vision-Language Models with Additive Residuals

Contact Info

Product

Resources

About