Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering

Zhu, Zihao; Yu, Jing; Wang, Yujing; Sun, Yajing; Hu, Yue; Wu, Qi

doi:10.24963/ijcai.2020/153

Cited by 85 publications

(48 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…OK‐VQA [24] provided a new data set including more than 14,000 questions that require external knowledge to answer. Mucko [25] and [26] utilised the graph structure to capture information from the external fact space for reasoning the answer.…”

Section: Related Workmentioning

confidence: 99%

Visual question answering with gated relation‐aware auxiliary

Shao

Xiang

2022

IET Image Processing

View full text Add to dashboard Cite

The great advances in computer vision and natural language processing make significant progress in visual question answering. In the visual question answering task, the visual representation is essential for understanding the image content. However, traditional methods rarely exploit the context information of the visual feature related to the question and the relation‐aware information to capture valuable visual representation. Therefore, a gated relation‐aware model is proposed to capture the enhanced visual representation for desiring answer prediction. The gated relation‐aware module can learn relation‐aware information between the visual feature and the context, and a certain object of an image, respectively. In addition, the proposed module can filter out the unnecessary relation‐aware information through the gate guided by the question semantic representation. The results of the conducted experiments show that the gated relation‐aware module makes a significant improvement on all answer categories.

show abstract

Section: Related Workmentioning

confidence: 99%

Visual question answering with gated relation‐aware auxiliary

Shao

Xiang

2022

IET Image Processing

View full text Add to dashboard Cite

show abstract

“…Graphs are non-Euclidean structured data, which can effectively represent relationships between nodes. Some recent works construct graphs for visual or linguistic elements in V+L tasks, such as VQA [16,27,43], VideoQA [28,30,78], Image Captioning [23,69,75], and Visual Grounding [31,47,68], to reveal relationships between these elements and obtain fine-grained semantic representations. These constructed graphs can be broadly grouped into three types: visual graphs between image objects/regions (e.g., [69]), linguistic graphs between sentence elements/tokens (e.g., [33]), and crossmodal graphs among visual and linguistic elements (e.g., [47]).…”

Section: Graph Construction In V+l Tasksmentioning

confidence: 99%

X-GGM: Graph Generative Modeling for Out-of-distribution Generalization in Visual Question Answering

Jiang

Liu

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Encouraging progress has been made towards Visual Question Answering (VQA) in recent years, but it is still challenging to enable VQA models to adaptively generalize to out-of-distribution (OOD) samples. Intuitively, recompositions of existing visual concepts (i.e., attributes and objects) can generate unseen compositions in the training set, which will promote VQA models to generalize to OOD samples. In this paper, we formulate OOD generalization in VQA as a compositional generalization problem and propose a graph generative modeling-based training scheme (X-GGM) to handle the problem implicitly. X-GGM leverages graph generative modeling to iteratively generate a relation matrix and node representations for the predefined graph that utilizes attribute-object pairs as nodes. Furthermore, to alleviate the unstable training issue in graph generative modeling, we propose a gradient distribution consistency loss to constrain the data distribution with adversarial perturbations and the generated distribution. The baseline VQA model (LXMERT) trained with the X-GGM scheme achieves state-of-the-art OOD performance on two standard VQA OOD benchmarks, i.e., VQA-CP v2 and GQA-OOD. Extensive ablation studies demonstrate the effectiveness of X-GGM components. CCS CONCEPTS• Computing methodologies → Computer vision tasks; • Information systems → Question answering.

show abstract

“…In (Wang et al, 2018), FVQA is approached as a parsing and factretrieval problem, while directly retrieves facts using lexical-semantic word embeddings. In Out-of-the-box (OOB) reasoning , a Graph Convolutional Network (Kipf and Welling, 2017) is used to reason about the correct entity, while (Zhu et al, 2020) (the current State-of-the-Art in the complete-KG FVQA task) added a visual scenegraph (Krishna et al, 2016) and a semantic graph based on the question alongside the (OOB) KG reasoning module. In (Ramnath and Hasegawa-Johnson, 2020), FVQA is tackled on incomplete KGs using KG embeddings to represent entities instead of word-embeddings, as the latter are shown to be inadequate for this task.…”

Section: Kgqamentioning

confidence: 99%

Worldly Wise (WoW) - Cross-Lingual Knowledge Fusion for Fact-based Visual Spoken-Question Answering

Ramnath¹,

Sarı

Hasegawa‐Johnson

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Although Question-Answering has long been of research interest, its accessibility to users through a speech interface and its support to multiple languages have not been addressed in prior studies. Towards these ends, we present a new task and a synthetically-generated dataset to do Fact-based Visual Spoken-Question Answering (FVSQA). FVSQA is based on the FVQA dataset, which requires a system to retrieve an entity from Knowledge Graphs (KGs) to answer a question about an image. In FVSQA, the question is spoken rather than typed. Three sub-tasks are proposed: (1) speech-to-text based, (2) end-to-end, without speech-to-text as an intermediate component, and (3) cross-lingual, in which the question is spoken in a language different from that in which the KG is recorded. The end-to-end and cross-lingual tasks are the first to require world knowledge from a multi-relational KG as a differentiable layer in an end-to-end spoken language understanding task, hence the proposed reference implementation is called Worldly-Wise (WoW). WoW is shown to perform endto-end cross-lingual FVSQA at same levels of accuracy across 3 languages -English, Hindi, and Turkish.

show abstract

Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering

Cited by 85 publications

References 2 publications

Visual question answering with gated relation‐aware auxiliary

Visual question answering with gated relation‐aware auxiliary

X-GGM: Graph Generative Modeling for Out-of-distribution Generalization in Visual Question Answering

Worldly Wise (WoW) - Cross-Lingual Knowledge Fusion for Fact-based Visual Spoken-Question Answering

Contact Info

Product

Resources

About