2021
DOI: 10.48550/arxiv.2101.06013
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge

Abstract: The limits of applicability of vision-andlanguage models are defined by the coverage of their training data. Tasks like vision question answering (VQA) often require commonsense and factual information beyond what can be learned from task-specific datasets. This paper investigates the injection of knowledge from general-purpose knowledge bases (KBs) into vision-and-language transformers. We use an auxiliary training objective that encourages the learned representations to align with graph embeddings of matchin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
8
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(10 citation statements)
references
References 39 publications
0
8
0
Order By: Relevance
“…More recently, the dataset outsideknowledge visual question answering (OK-VQA) [39] is proposed where the usage of outside knowledge is open to the entire web. Most existing work for OK-VQA rely on the pre-trained vision-language models as a major workhorse for question answering [12,36,38,48,60,63]. In [12,48], learned knowledge embeddings are injected into visionlanguage models to perform knowledge-aware question answering.…”
Section: Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…More recently, the dataset outsideknowledge visual question answering (OK-VQA) [39] is proposed where the usage of outside knowledge is open to the entire web. Most existing work for OK-VQA rely on the pre-trained vision-language models as a major workhorse for question answering [12,36,38,48,60,63]. In [12,48], learned knowledge embeddings are injected into visionlanguage models to perform knowledge-aware question answering.…”
Section: Related Workmentioning
confidence: 99%
“…Most existing work for OK-VQA rely on the pre-trained vision-language models as a major workhorse for question answering [12,36,38,48,60,63]. In [12,48], learned knowledge embeddings are injected into visionlanguage models to perform knowledge-aware question answering. Other work uses vision-language models as a knowledge-free VQA model first and later adjusts the predicted answers by fusion with knowledge graphs [38] or answer validation with knowledge text [60].…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations