ClipCap: CLIP Prefix for Image Captioning

Mokady, Ron; Hertz, Amir; Bermano, Amit H.

doi:10.48550/arxiv.2111.09734

Cited by 109 publications

(185 citation statements)

References 35 publications

Supporting

Mentioning

184

Contrasting

Order By: Relevance

“…However, this ability is exactly the generative task DALL-E was trained to do, only in new domains. No previous computer vision work, as far as we can ascertain, has Method B@1 R@5 C-s B@1 R@5 C-s B@1 R@5 C-s B@1 R@5 C-s B@1 R@5 C-s CLIP-Prefix [49] 2. Comparison of our method and CLIP-Prefix baseline on our novel benchmark for visual relations.…”

Section: Discussion and Limitationsmentioning

confidence: 99%

“…1, we present our results for COCO's test set [42]. Two recent baselines that use CLIP's embedding are compared to: CLIP-Prefix [49] and CLIP-VL [61]. In CLIP-Prefix, the image is encoded using CLIP and the representation is transferred and plugged as a token into a fine-tuned GPT-2.…”

Section: Image Captioningmentioning

confidence: 99%

“…We compared our results with CLIP-Prefix [49] that encodes the image with CLIP's image encoder and uses it as an initial token for GPT-2. The method is fine-tuned based on COCO dataset.…”

Section: Visual-semantic Arithmetic Studymentioning

confidence: 99%

“…Generated captions by our method and by the baseline methods for images from the MS-COCO [42] test-set. CP=CLIP-Prefix [49], CVL=CLIP-VL [61], VVL=VinVL [74].…”

Section: Visual Relations Benchmark Studymentioning

confidence: 99%

“…14 (shown at the end of the document due to size), we present our results on 200 randomly-selected images along with baselines. For baselines, we use CLIPPrefix [49], CLIP-VL [61], and VinVL [74]. Our method generates original captions that…”

Section: Appendicesmentioning

confidence: 99%

See 4 more Smart Citations

ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Tewel¹,

Shalev²,

Wolf³

2021

Preprint

View full text Add to dashboard Cite

Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. While such models can provide a powerful score for matching and subsequent zero-shot tasks, they are not capable of generating caption given an image. In this work, we repurpose such models to generate a descriptive text given an image at inference time, without any further training or tuning step. This is done by combining the visual-semantic model with a large language model, benefiting from the knowledge in both web-scale models. The resulting captions are much less restrictive than those obtained by supervised captioning methods. Moreover, as a zero-shot learning method, it is extremely flexible and we demonstrate its ability to perform image arithmetic in which the inputs can be either images or text and the output is a sentence. This enables novel high-level vision capabilities such as comparing two images or solving visual analogy tests. Our code is available at: https://github. com/YoadTew/zero-shot-image-to-text.

show abstract

Section: Discussion and Limitationsmentioning

confidence: 99%

Section: Image Captioningmentioning

confidence: 99%

“…We compared our results with CLIP-Prefix [49] that encodes the image with CLIP's image encoder and uses it as an initial token for GPT-2. The method is fine-tuned based on COCO dataset.…”

Section: Visual-semantic Arithmetic Studymentioning

confidence: 99%

“…Generated captions by our method and by the baseline methods for images from the MS-COCO [42] test-set. CP=CLIP-Prefix [49], CVL=CLIP-VL [61], VVL=VinVL [74].…”

Section: Visual Relations Benchmark Studymentioning

confidence: 99%

Section: Appendicesmentioning

confidence: 99%

See 3 more Smart Citations

ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Tewel¹,

Shalev²,

Wolf³

2021

Preprint

View full text Add to dashboard Cite

show abstract

A Survey on Automatic Image Captioning Approaches: Contemporary Trends and Future Perspectives

Salgotra,

Abrol,

Selwal

2024

Arch Computat Methods Eng

View full text Add to dashboard Cite

XAINES: Explaining AI with Narratives

Hartmann

Feldhus

et al. 2022

Künstl Intell

View full text Add to dashboard Cite

Artificial Intelligence (AI) systems are increasingly pervasive: Internet of Things, in-car intelligent devices, robots, and virtual assistants, and their large-scale adoption makes it necessary to explain their behaviour, for example to their users who are impacted by their decisions, or to their developers who need to ensure their functionality. This requires, on the one hand, to obtain an accurate representation of the chain of events that caused the system to behave in a certain way (e.g., to make a specific decision). On the other hand, this causal chain needs to be communicated to the users depending on their needs and expectations. In this phase of explanation delivery, allowing interaction between user and model has the potential to improve both model quality and user experience. The XAINES project investigates the explanation of AI systems through narratives targeted to the needs of a specific audience, focusing on two important aspects that are crucial for enabling successful explanation: generating and selecting appropriate explanation content, i.e. the information to be contained in the explanation, and delivering this information to the user in an appropriate way. In this article, we present the project’s roadmap towards enabling the explanation of AI with narratives.

show abstract

ClipCap: CLIP Prefix for Image Captioning

Cited by 109 publications

References 35 publications

ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

A Survey on Automatic Image Captioning Approaches: Contemporary Trends and Future Perspectives

XAINES: Explaining AI with Narratives

Contact Info

Product

Resources

About