ConceptBert: Concept-Aware Representation for Visual Question Answering

Gardères, François; Ziaeefard, Maryam; Abeloos, Baptiste; Lécué, Freddy

doi:10.18653/v1/2020.findings-emnlp.44

Cited by 91 publications

(54 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Mucko (Zhu et al, 2020) goes a step further, reasoning on visual, fact, and semantic graphs separately, and uses cross-modal networks to aggregate them together. ConceptBert (Gardères et al, 2020) combines the BERT-pretrained model (Devlin et al, 2019) with KG. It encodes the KG using a transformer with a BERT embedding query.…”

Section: Related Workmentioning

confidence: 99%

Weakly-Supervised Visual-Retriever-Reader for Knowledge-based Question Answering

Luo¹,

Zeng²,

Banerjee³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Knowledge-based visual question answering (VQA) requires answering questions with external knowledge in addition to the content of images. One dataset that is mostly used in evaluating knowledge-based VQA is OK-VQA, but it lacks a gold standard knowledge corpus for retrieval. Existing work leverage different knowledge bases (e.g., ConceptNet and Wikipedia) to obtain external knowledge. Because of varying knowledge bases, it is hard to fairly compare models' performance. To address this issue, we collect a natural language knowledge base that can be used for any VQA system. Moreover, we propose a Visual Retriever-Reader pipeline to approach knowledge-based VQA. The visual retriever aims to retrieve relevant knowledge, and the visual reader seeks to predict answers based on given knowledge. We introduce various ways to retrieve knowledge using text and images and two reader styles: classification and extraction. Both the retriever and reader are trained with weak supervision. Our experimental results show that a good retriever can significantly improve the reader's performance on the OK-VQA challenge. The code and corpus are provided in this link. * Equal contributionQuestion: What sort of vehicle used this item? Answer: fire truck LXMERT: truck LXMERT + Caption: fire truck Ours: fire truck kn: fire engine, also called fire truck, mobile (nowadays selfpropelled) piece of equipment used in firefighting.... Caption: a red fire hydrant sitting on the side of a road. Question: Where did this sport originate? Answer: australia, hawaii, polynesian LXMERT: california LXMERT + Caption: california Ours: hawaii kn: surfing was invented in hawaii... Caption: a man riding a wave on a surfboard in the ocean.

show abstract

Section: Related Workmentioning

confidence: 99%

Weakly-Supervised Visual-Retriever-Reader for Knowledge-based Question Answering

Luo¹,

Zeng²,

Banerjee³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…Recent approaches have shown great potential to incorporate external knowledge for knowledgebased VQA. Several methods explore aggregating the external knowledge either in the form of structured knowledge graphs (Garderes et al, 2020;Narasimhan et al, 2018;Li et al, 2020b;Wang et al, 2017a,b) or unstructured knowledge bases (Marino et al, 2021;Wu et al, 2021;Luo et al, 2021). In these methods, object detectors (Ren et al, 2015) and scene classifiers (He et al, 2016) are used to associate images with external knowledge.…”

Section: Knowledge-basedmentioning

confidence: 99%

“…Further, external APIs, such as Google (Wu et al, 2021;Luo et al, 2021), Microsoft (Yang et al, 2021), and OCR (Luo et al, 2021;Wu et al, 2021) are used to enrich the associated knowledge. Finally, pre-trained transformerbased language models (Yang et al, 2021) or multimodal models (Wu et al, 2021;Luo et al, 2021;Wu et al, 2021;Garderes et al, 2020;Marino et al, 2021) are leveraged as implicit knowledge bases for answer predictions.…”

Section: Knowledge-basedmentioning

confidence: 99%

“…The key challenge here is to accurately link image content to abstract external knowledge. There have been a number of recent developments demonstrating the feasibility of incorporating external knowledge into Question Answering models (Wang et al, 2017b;Li et al, 2020b;Marino et al, 2021;Wu et al, 2021;Garderes et al, 2020). Existing methods first retrieve external knowledge from multiple external knowledge resources, such as DBPedia (Auer et al, 2007) and ConceptNet (Liu and Singh, 2004) before jointly reasoning over the retrieved knowledge and image content to predict an answer.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

KAT: A Knowledge Augmented Transformer for Vision-and-Language

Gui¹,

Wang²,

Huang³

et al. 2021

Preprint

View full text Add to dashboard Cite

The primary focus of recent work with largescale transformers has been on optimizing the amount of information packed into the model's parameters. In this work, we ask a different question: Can multimodal transformers leverage explicit knowledge in their reasoning?Existing, primarily unimodal, methods have explored approaches under the paradigm of knowledge retrieval followed by answer prediction, but leave open questions about the quality and relevance of the retrieved knowledge used, and how the reasoning processes over implicit and explicit knowledge should be integrated. To address these challenges, we propose a novel model -Knowledge Augmented Transformer (KAT) -which achieves a strong state-of-the-art result (+6 points absolute) on the open-domain multimodal task of OK-VQA. Our approach integrates implicit and explicit knowledge in an end to end encoder-decoder architecture, while still jointly reasoning over both knowledge sources during answer generation. An additional benefit of explicit knowledge integration is seen in improved interpretability of model predictions in our analysis.

show abstract

“…Incorporating external knowledge into VQA models combines visual observations with external knowledge (Garderes et al, 2020). Organizing the external knowledge and storing them in a structured database, such as a Knowledge Bases (KB), have become important resources for representing the general knowledge.…”

Section: Introductionmentioning

confidence: 99%

Towards Knowledge-Augmented Visual Question Answering

Ziaeefard

Lécué

2020

Proceedings of the 28th International Conference on Computational Linguistics

Self Cite

View full text Add to dashboard Cite

Visual Question Answering (VQA) remains algorithmically challenging while it is effortless for humans. Humans combine visual observations with general and commonsense knowledge to answer a question about a given image. In this paper, we address the problem of incorporating general knowledge into VQA models while leveraging the visual information. We propose a model that captures the interactions between objects in a visual scene and entities in an external knowledge source. Our model is a graph-based approach that combines scene graphs with concept graphs, which learns a question-adaptive graph representation of related knowledge instances. We use Graph Attention Networks to set higher importance to key knowledge instances that are mostly relevant to each question. We exploit ConceptNet as the source of general knowledge and evaluate the performance of our model on the challenging OK-VQA dataset. Our code will be available at https:

show abstract

ConceptBert: Concept-Aware Representation for Visual Question Answering

Cited by 91 publications

References 20 publications

Weakly-Supervised Visual-Retriever-Reader for Knowledge-based Question Answering

Weakly-Supervised Visual-Retriever-Reader for Knowledge-based Question Answering

KAT: A Knowledge Augmented Transformer for Vision-and-Language

Towards Knowledge-Augmented Visual Question Answering

Contact Info

Product

Resources

About