2021
DOI: 10.48550/arxiv.2104.10283
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

GraghVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering

Abstract: Images are more than a collection of objects or attributes -they represent a web of relationships among interconnected objects. Scene Graph has emerged as a new modality as a structured graphical representation of images. Scene Graph encodes objects as nodes connected via pairwise relations as edges. To support question answering on scene graphs, we propose GraphVQA, a language-guided graph neural network framework that translates and executes a natural language question as multiple iterations of message passi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
1
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(5 citation statements)
references
References 18 publications
0
1
0
Order By: Relevance
“…al [50] created a new benchmark, OK-VQA that contains most of the open domain questions where information provided in image and question alone is not enough to answer the question. Many authors have used a graph-based approach [17,41,42,43] to integrate external KB into the VQA model where the important objects in the visual image represent a node in a graph and the relationship between these nodes as edge. [38] proposed a novel methodology that takes the question's textual keywords and important visual items from the image, using these to extract knowledge from ConceptNet in the form of a knowledge graph.…”
Section: Knowledge Incorporated Vqa Modelsmentioning
confidence: 99%
“…al [50] created a new benchmark, OK-VQA that contains most of the open domain questions where information provided in image and question alone is not enough to answer the question. Many authors have used a graph-based approach [17,41,42,43] to integrate external KB into the VQA model where the important objects in the visual image represent a node in a graph and the relationship between these nodes as edge. [38] proposed a novel methodology that takes the question's textual keywords and important visual items from the image, using these to extract knowledge from ConceptNet in the form of a knowledge graph.…”
Section: Knowledge Incorporated Vqa Modelsmentioning
confidence: 99%
“…To answer VQA questions, models typically leverage variants of attention to obtain a representation of the image that is relevant to the question (Andreas et al, 2016;Yang et al, 2015;Xu and Saenko, 2016;Fukui et al, 2016;Lu et al, 2016). A plethora of works (Liang et al, 2021;Hudson and Manning, 2018;Yi et al, 2018b;Xiong et al, 2016;Kim et al, 2018;Teney et al, 2017a) have attempted to enhance the reasoning capability of VQA models, with Teney et al (2017a) proposing to improve VQA using structured representations of the scene contents and questions. They developed a deep neural network that leverages the structure in these representations and builds graphs over scene objects and question words.…”
Section: Multimodal Question Answeringmentioning
confidence: 99%
“…Summaira, Jabeen, et al [17] conducted an extensive study on multimodal deep learning, incorporating modalities such as image, video, text, audio, body gestures, facial expressions, and physiological signals. Among those, there have been studies applying multimodal deep learning combining images and text, including visual-based referred expression understanding and phrase localization [18,19], as well as image and video captioning [20][21][22], textto-image generation [23][24][25], and visual question answering (VQA) [26][27][28][29][30]. Antol et al [26] developed a multimodal deep learning model for the visual question answering (VQA) task.…”
Section: Application Of Multimodal Techniques In Other Fieldsmentioning
confidence: 99%