Proceedings of the 31st ACM International Conference on Multimedia 2023
DOI: 10.1145/3581783.3612389
|View full text |Cite
|
Sign up to set email alerts
|

Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts

Yunshi Lan,
Xiang Li,
Xin Liu
et al.
Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 8 publications
(2 citation statements)
references
References 57 publications
0
2
0
Order By: Relevance
“…Recent developments have witnessed significant progress in the alignment of images with accompanying text, such as Contrastive Language-Image Pretraining (CLIP) [6]. In addition to the multimodal uses of CLIP [7][8][9][10][11], the visual features provided by CLIP have showcased remarkable versatility in diverse applications, such as captioning [12][13][14][15], object detection [16], semantic image segmentation [17], cross-modal retrieval tasks [18][19][20], etc. This wide-ranging utilization underscores the broad applicability and robust performance of CLIP and its derivatives across a spectrum of interdisciplinary challenges.…”
Section: Introductionmentioning
confidence: 99%
“…Recent developments have witnessed significant progress in the alignment of images with accompanying text, such as Contrastive Language-Image Pretraining (CLIP) [6]. In addition to the multimodal uses of CLIP [7][8][9][10][11], the visual features provided by CLIP have showcased remarkable versatility in diverse applications, such as captioning [12][13][14][15], object detection [16], semantic image segmentation [17], cross-modal retrieval tasks [18][19][20], etc. This wide-ranging utilization underscores the broad applicability and robust performance of CLIP and its derivatives across a spectrum of interdisciplinary challenges.…”
Section: Introductionmentioning
confidence: 99%
“…poor translation quality. DDPM resolves this question by using a large-scale pre-training with text-to-image data [2]- [4] and integrating multimodal information like large-scale language models [13], [14].…”
Section: Introductionmentioning
confidence: 99%