Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence 2020
DOI: 10.24963/ijcai.2020/153
|View full text |Cite
|
Sign up to set email alerts
|

Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering

Abstract: Fact-based Visual Question Answering (FVQA) requires external knowledge beyond the visible content to answer questions about an image. This ability is challenging but indispensable to achieve general VQA. One limitation of existing FVQA solutions is that they jointly embed all kinds of information without fine-grained selection, which introduces unexpected noises for reasoning the final answer. How to capture the question-oriented and information-complementary evidence remains a key challenge to solve … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
48
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 85 publications
(48 citation statements)
references
References 2 publications
0
48
0
Order By: Relevance
“…OK‐VQA [24] provided a new data set including more than 14,000 questions that require external knowledge to answer. Mucko [25] and [26] utilised the graph structure to capture information from the external fact space for reasoning the answer.…”
Section: Related Workmentioning
confidence: 99%
“…OK‐VQA [24] provided a new data set including more than 14,000 questions that require external knowledge to answer. Mucko [25] and [26] utilised the graph structure to capture information from the external fact space for reasoning the answer.…”
Section: Related Workmentioning
confidence: 99%
“…Graphs are non-Euclidean structured data, which can effectively represent relationships between nodes. Some recent works construct graphs for visual or linguistic elements in V+L tasks, such as VQA [16,27,43], VideoQA [28,30,78], Image Captioning [23,69,75], and Visual Grounding [31,47,68], to reveal relationships between these elements and obtain fine-grained semantic representations. These constructed graphs can be broadly grouped into three types: visual graphs between image objects/regions (e.g., [69]), linguistic graphs between sentence elements/tokens (e.g., [33]), and crossmodal graphs among visual and linguistic elements (e.g., [47]).…”
Section: Graph Construction In V+l Tasksmentioning
confidence: 99%
“…In (Wang et al, 2018), FVQA is approached as a parsing and factretrieval problem, while directly retrieves facts using lexical-semantic word embeddings. In Out-of-the-box (OOB) reasoning , a Graph Convolutional Network (Kipf and Welling, 2017) is used to reason about the correct entity, while (Zhu et al, 2020) (the current State-of-the-Art in the complete-KG FVQA task) added a visual scenegraph (Krishna et al, 2016) and a semantic graph based on the question alongside the (OOB) KG reasoning module. In (Ramnath and Hasegawa-Johnson, 2020), FVQA is tackled on incomplete KGs using KG embeddings to represent entities instead of word-embeddings, as the latter are shown to be inadequate for this task.…”
Section: Kgqamentioning
confidence: 99%