Evcap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension

Li, Jiaxuan; Vo, Duc Minh; Sugimoto, Akihiro; Nakayama, Hideki

doi:10.1109/cvpr52733.2024.01303

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2024

Publication Types

Select...

Book2

Article1

Preprint1

Relationship

Self Cite0

Independent4

Authors

Journals

Cited by 4 publications

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

PPCap: A Plug and Play Framework for Efficient Stylized Image Captioning

Wei,

Li,

Liu

et al. 2024

Lecture Notes in Computer Science

View full text Add to dashboard Cite

PPCap: A Plug and Play Framework for Efficient Stylized Image Captioning

Wei,

Li,

Liu

et al. 2024

Lecture Notes in Computer Science

View full text Add to dashboard Cite

M-RAT: a Multi-grained Retrieval Augmentation Transformer for Image Captioning

Song,

Pan,

Zhou

et al. 2024

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Spatial guided image captioning: Guiding attention with object's spatial interaction

Du,

Zhang,

et al. 2024

IET Image Processing

View full text Add to dashboard Cite

Nowadays relational position embedding is widely used in many large multi‐modal models. It begins with relational captioning (a branch of image captioning) and contains two procedures: geometric modelling and prior attention. However, there are some problems that remain unsolved in the conventional procedures. This paper reviews the shortcomings of geometric modelling and prior attention. Then, a new framework called relational guided transformer (RGT) is proposed to verify the authors' conclusion from the origin of relational position embedding—relational captioning. Specifically, RGT has two simple but effective improvements in geometric modelling and prior attention: (1) A machine‐learned geometric modelling strategy called multi‐task geometric modelling (MTG) is used under multi‐task learning, replacing the original hand‐made geometric feature. (2) The effectiveness of multiple kinds of prior attention is discussed and preserved in a better form, which is called spatial guided attention (SGA) to integrate the geometric prior knowledge. Extensive experiments on MSCOCO and Flickr30k have been performed to investigate the effectiveness of each module and prove our argument. The superiority of the model comparing to the authors' baseline has also been proven on the offline evaluation with the “Karpathy” test split of both datasets.

show abstract

Evcap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension

Cited by 4 publications

References 23 publications

PPCap: A Plug and Play Framework for Efficient Stylized Image Captioning

PPCap: A Plug and Play Framework for Efficient Stylized Image Captioning

M-RAT: a Multi-grained Retrieval Augmentation Transformer for Image Captioning

Spatial guided image captioning: Guiding attention with object's spatial interaction

Contact Info

Product

Resources

About