Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021
DOI: 10.18653/v1/2021.emnlp-main.517
|View full text |Cite
|
Sign up to set email alerts
|

Weakly-Supervised Visual-Retriever-Reader for Knowledge-based Question Answering

Abstract: Knowledge-based visual question answering (VQA) requires answering questions with external knowledge in addition to the content of images. One dataset that is mostly used in evaluating knowledge-based VQA is OK-VQA, but it lacks a gold standard knowledge corpus for retrieval. Existing work leverage different knowledge bases (e.g., ConceptNet and Wikipedia) to obtain external knowledge. Because of varying knowledge bases, it is hard to fairly compare models' performance. To address this issue, we collect a natu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
22
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 31 publications
(23 citation statements)
references
References 36 publications
1
22
0
Order By: Relevance
“…ViLBERT (Lu et al, 2019) and LXMERT (Tan and Bansal, 2019) propose a two-stream architecture to process images and text independently and fused by a third transformer in ta later stage. While these models have shown to store in-depth cross-modal knowledge and achieved impressive performance on knowledge-based VQA (Marino et al, 2021;Wu et al, 2022;Luo et al, 2021), this type of implicitly learned knowledge is not sufficient to answer many knowledge-based questions (Marino et al, 2021). Another line of work for multimodal transformers, such as CLIP (Radford et al, 2021) or ALIGN (Jia et al, 2021), aligns visual and language representations by contrastive learning.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations
“…ViLBERT (Lu et al, 2019) and LXMERT (Tan and Bansal, 2019) propose a two-stream architecture to process images and text independently and fused by a third transformer in ta later stage. While these models have shown to store in-depth cross-modal knowledge and achieved impressive performance on knowledge-based VQA (Marino et al, 2021;Wu et al, 2022;Luo et al, 2021), this type of implicitly learned knowledge is not sufficient to answer many knowledge-based questions (Marino et al, 2021). Another line of work for multimodal transformers, such as CLIP (Radford et al, 2021) or ALIGN (Jia et al, 2021), aligns visual and language representations by contrastive learning.…”
Section: Related Workmentioning
confidence: 99%
“…Recent approaches have shown a great potential to incorporate external knowledge for knowledgebased VQA. Several methods explore aggregating the external knowledge either in the form of structured knowledge graphs (Garderes et al, 2020;Narasimhan et al, 2018;Li et al, 2020b;Wang et al, 2017a,b), unstructured knowledge bases (Marino et al, 2021;Wu et al, 2022;Luo et al, 2021), and neural-symbolic inference based knowledge (Chen et al, 2020;West et al, 2021). In these methods, object detectors (Ren et al, 2015) and scene classifiers (He et al, 2016) are used to associate images with external knowledge.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Result We evaluate three retrievers on OK-VQA dataset and use the knowledge base (with 112,724 pieces of knowledge) created in (Luo et al, 2021b) as the corpus. We retrieve 1/5/10/20/50/80/100 knowledge for each question.…”
Section: How To Retrieve Information Formentioning
confidence: 99%