2022
DOI: 10.48550/arxiv.2205.01883
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

All You May Need for VQA are Image Captions

Abstract: Visual Question Answering (VQA) has benefited from increasingly sophisticated models, but has not enjoyed the same level of engagement in terms of data creation. In this paper, we propose a method that automatically derives VQA examples at volume, by leveraging the abundance of existing image-caption annotations combined with neural models for textual question generation. We show that the resulting data is of high-quality. VQA models trained on our data improve state-of-theart zero-shot accuracy by double digi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
4
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(5 citation statements)
references
References 42 publications
0
4
0
Order By: Relevance
“…6 . Regarding different baselines, such as Img2Prompt 7 , PICa 8 , we follow their official implementation to convert images into captions via either VinVL-base pre-trained checkpoint 9 or BLIP 10 and generate exemplar prompts via either CLIP 11 or finetuned T5-large model 12 . Notably, we implement a light version of 4 https://huggingface.co/docs/transformers/model_doc/opt 5 https://openai.com 6 https://huggingface.co/docs/transformers/model_doc/bloom 7 https://github.com/salesforce/LAVIS/tree/main/projects/img2llm-vqa 8 https://github.com/microsoft/PICa 9 https://github.com/pzzhang/VinVL 10 https://github.com/salesforce/BLIP 11 https://github.com/OpenAI/CLIP 12 https://github.com/google-research/text-to-text-transfer-transformer Img2Prompt on VQAv2 dataset due to our computation limitation, the details of which can be found in Appendix A.2.…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…6 . Regarding different baselines, such as Img2Prompt 7 , PICa 8 , we follow their official implementation to convert images into captions via either VinVL-base pre-trained checkpoint 9 or BLIP 10 and generate exemplar prompts via either CLIP 11 or finetuned T5-large model 12 . Notably, we implement a light version of 4 https://huggingface.co/docs/transformers/model_doc/opt 5 https://openai.com 6 https://huggingface.co/docs/transformers/model_doc/bloom 7 https://github.com/salesforce/LAVIS/tree/main/projects/img2llm-vqa 8 https://github.com/microsoft/PICa 9 https://github.com/pzzhang/VinVL 10 https://github.com/salesforce/BLIP 11 https://github.com/OpenAI/CLIP 12 https://github.com/google-research/text-to-text-transfer-transformer Img2Prompt on VQAv2 dataset due to our computation limitation, the details of which can be found in Appendix A.2.…”
Section: Methodsmentioning
confidence: 99%
“…A general solution is to augment image-question pairs for training. Multi-modal pre-training models like CLIP [38] are frequently leveraged to generate synthetic question-answer pairs from images [4,7].…”
Section: Zero/few Shot Of Vqa Tasksmentioning
confidence: 99%
See 2 more Smart Citations
“…Seeing is Knowing (106) , MULAN (107) Faster R-CNN with ResNet-101 GAT (108) , ATH (109) , DMMGR (24) , MCLN (110) , MCAN (111) , F-SWAP (112) , SRRN (35) , TVQA (113) Faster R-CNN with Resnet-152 RA-MAP (114) , MASN (115) , Anamoly based (114) , Vocab based (116) , DA-Net (117) ResNet CNN within Faster R-CNN MuVAM (118) FasterR-CNN with ResNext-152 CBM (119) RCNN (120) Multi-image (89) VGGNet (121) VQA-AID (122) EfficientNetV2 (123) RealFormer (124) YOLO (125) Scene Text VQA (126) CLIPViT-B CCVQA (14) Resnet NFNet (127) Flamingo (128) ViT (129) VLMmed (46) , ConvS2S+ViT (130) , BMT (10) , M2I2 (52) XCLIP with ViT-L/14 CMQR (32) RsNet18, Swin, ViT LV-GPT (43) GLIP (131) REVIVE (132) CLIP (133) KVQAE (30) 2.6.4 VGGNet (121) VGGNet (Visual Geometry Group Network) is a CNN with a small number of layers, achieving good performance in image classification tasks. It is basically known for its simplicity and generalizability to new datasets.…”
Section: Faster Rcnnmentioning
confidence: 99%