2018
DOI: 10.1007/978-3-030-01237-3_28
|View full text |Cite
|
Sign up to set email alerts
|

Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering

Abstract: Question answering is an important task for autonomous agents and virtual assistants alike and was shown to support the disabled in efficiently navigating an overwhelming environment. Many existing methods focus on observation-based questions, ignoring our ability to seamlessly combine observed content with general knowledge. To understand interactions with a knowledge base, a dataset has been introduced recently and keyword matching techniques were shown to yield compelling results despite being vulnerable to… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
78
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
4

Relationship

2
7

Authors

Journals

citations
Cited by 97 publications
(78 citation statements)
references
References 57 publications
0
78
0
Order By: Relevance
“…In recent years various machine learning techniques were developed to tackle cognitive-like multimodal tasks, which involve both vision and language processing. Image captioning [36,56,24,50,7,4,13] was an instrumental language+vision task, followed by visual question answering [33,42,25,34,41,5,15,59,23,3,9,14,46,55,42,54,39,40,43] and visual question generation [41,38,22,49,28,6].…”
Section: Related Workmentioning
confidence: 99%
“…In recent years various machine learning techniques were developed to tackle cognitive-like multimodal tasks, which involve both vision and language processing. Image captioning [36,56,24,50,7,4,13] was an instrumental language+vision task, followed by visual question answering [33,42,25,34,41,5,15,59,23,3,9,14,46,55,42,54,39,40,43] and visual question generation [41,38,22,49,28,6].…”
Section: Related Workmentioning
confidence: 99%
“…For instance, in computer vision, a tremendous amount of recent work has focused on image captioning [68,30,11,16,75,45,77,31,69,4,15,10], visual question generation [36,48,47,28], visual question answering [5,19,59,54,44,73,74,76,57,58,49,50], and very recently visual dialog [13,14,27,46]. While those meticulously engineered algorithms have shown promising results in their specific domain, little is known about the end-to-end performance of an entire system.…”
Section: Introductionmentioning
confidence: 99%
“…Our method with finetuned QANet achieves the highest top-1 accuracy, which is 0.7% higher than the state-of-the-art result. It should be note that [23] has the top-3-QQmapping accuracy of 91.97%, which is 9% higher than what we used. The QQmapping results have a direct influence on retrieving the related supporting facts.…”
Section: Results Analysis On Fvqamentioning
confidence: 53%
“…This method is vulnerable to misconceptions caused by synonyms and homographs. A learning based approach was then developed in [23] for FVQA, which learns a parametric mapping of facts and question-image pairs to an embedding space that permits to assess their compatibility. Features are concatenated over the image-question-answer-facts tuples.…”
Section: Knowledge-based Vqamentioning
confidence: 99%