Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering

Narasimhan, Medhini; Schwing, Alexander G.

doi:10.1007/978-3-030-01237-3_28

Cited by 97 publications

(78 citation statements)

References 57 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In recent years various machine learning techniques were developed to tackle cognitive-like multimodal tasks, which involve both vision and language processing. Image captioning [36,56,24,50,7,4,13] was an instrumental language+vision task, followed by visual question answering [33,42,25,34,41,5,15,59,23,3,9,14,46,55,42,54,39,40,43] and visual question generation [41,38,22,49,28,6].…”

Section: Related Workmentioning

confidence: 99%

Factor Graph Attention

Schwartz

Hazan

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Self Cite

110

View full text Add to dashboard Cite

Dialog is an effective way to exchange information, but subtle details and nuances are extremely important. While significant progress has paved a path to address visual dialog with algorithms, details and nuances remain a challenge. Attention mechanisms have demonstrated compelling results to extract details in visual question answering and also provide a convincing framework for visual dialog due to their interpretability and effectiveness. However, the many data utilities that accompany visual dialog challenge existing attention techniques. We address this issue and develop a general attention mechanism for visual dialog which operates on any number of data utilities. To this end, we design a factor graph based attention mechanism which combines any number of utility representations. We illustrate the applicability of the proposed approach on the challenging and recently introduced VisDial datasets, outperforming recent state-of-the-art methods by 1.1% for VisDial0.9 and by 2% for VisDial1.0 on MRR. Our ensemble model improved the MRR score on VisDial1.0 by more than 6%.

show abstract

Section: Related Workmentioning

confidence: 99%

Factor Graph Attention

Schwartz

Hazan

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Self Cite

110

View full text Add to dashboard Cite

show abstract

“…For instance, in computer vision, a tremendous amount of recent work has focused on image captioning [68,30,11,16,75,45,77,31,69,4,15,10], visual question generation [36,48,47,28], visual question answering [5,19,59,54,44,73,74,76,57,58,49,50], and very recently visual dialog [13,14,27,46]. While those meticulously engineered algorithms have shown promising results in their specific domain, little is known about the end-to-end performance of an entire system.…”

Section: Introductionmentioning

confidence: 99%

A Simple Baseline for Audio-Visual Scene-Aware Dialog

Schwing

Hazan

2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Self Cite

View full text Add to dashboard Cite

The recently proposed audio-visual scene-aware dialog task paves the way to a more data-driven way of learning virtual assistants, smart speakers and car navigation systems. However, very little is known to date about how to effectively extract meaningful information from a plethora of sensors that pound the computational engine of those devices. Therefore, in this paper, we provide and carefully analyze a simple baseline for audio-visual scene-aware dialog which is trained end-to-end. Our method differentiates in a datadriven manner useful signals from distracting ones using an attention mechanism. We evaluate the proposed approach on the recently introduced and challenging audio-visual sceneaware dataset, and demonstrate the key features that permit to outperform the current state-of-the-art by more than 20% on CIDEr. Recent work on audio-visual scene aware dialog [2,25] partly addresses this shortcoming and proposes a novel Question: what color is the rag ?Answer: it appears to be white . MultiModal-Attention:Question: where is the video taking place ? MultiModal-Attention:Answer: the video starts with a man in the kitchen . Question:does he speak at all ?Answer: no he does not speak . MultiModal-Attention:Question: do they get up from the chair? MultiModal-Attention:Answer: no , they stay sitting in the chair .

show abstract

“…Our method with finetuned QANet achieves the highest top-1 accuracy, which is 0.7% higher than the state-of-the-art result. It should be note that [23] has the top-3-QQmapping accuracy of 91.97%, which is 9% higher than what we used. The QQmapping results have a direct influence on retrieving the related supporting facts.…”

Section: Results Analysis On Fvqamentioning

confidence: 53%

“…This method is vulnerable to misconceptions caused by synonyms and homographs. A learning based approach was then developed in [23] for FVQA, which learns a parametric mapping of facts and question-image pairs to an embedding space that permits to assess their compatibility. Features are concatenated over the image-question-answer-facts tuples.…”

Section: Knowledge-based Vqamentioning

confidence: 99%

Visual Question Answering as Reading Comprehension

Wang

Shen

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

Visual question answering (VQA) demands simultaneous comprehension of both the image visual content and natural language questions. In some cases, the reasoning needs the help of common sense or general knowledge which usually appear in the form of text. Current methods jointly embed both the visual information and the textual feature into the same space. However, how to model the complex interactions between the two different modalities is not an easy task. In contrast to struggling on multimodal feature fusion, in this paper, we propose to unify all the input information by natural language so as to convert VQA into a machine reading comprehension problem. With this transformation, our method not only can tackle VQA datasets that focus on observation based questions, but can also be naturally extended to handle knowledge-based VQA which requires to explore large-scale external knowledge base. It is a step towards being able to exploit large volumes of text and natural language processing techniques to address VQA problem. Two types of models are proposed to deal with open-ended VQA and multiple-choice VQA respectively. We evaluate our models on three VQA benchmarks. The comparable performance with the state-of-the-art demonstrates the effectiveness of the proposed method.

show abstract

Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering

Cited by 97 publications

References 57 publications

Factor Graph Attention

Factor Graph Attention

A Simple Baseline for Audio-Visual Scene-Aware Dialog

Visual Question Answering as Reading Comprehension

Contact Info

Product

Resources

About