Weakly-Supervised Visual-Retriever-Reader for Knowledge-based Question Answering

Luo, Man; Zeng, Yankai; Banerjee, Pratyay; Baral, Chitta

doi:10.18653/v1/2021.emnlp-main.517

Cited by 31 publications

(23 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…ViLBERT (Lu et al, 2019) and LXMERT (Tan and Bansal, 2019) propose a two-stream architecture to process images and text independently and fused by a third transformer in ta later stage. While these models have shown to store in-depth cross-modal knowledge and achieved impressive performance on knowledge-based VQA (Marino et al, 2021;Wu et al, 2022;Luo et al, 2021), this type of implicitly learned knowledge is not sufficient to answer many knowledge-based questions (Marino et al, 2021). Another line of work for multimodal transformers, such as CLIP (Radford et al, 2021) or ALIGN (Jia et al, 2021), aligns visual and language representations by contrastive learning.…”

Section: Related Workmentioning

confidence: 99%

“…Recent approaches have shown a great potential to incorporate external knowledge for knowledgebased VQA. Several methods explore aggregating the external knowledge either in the form of structured knowledge graphs (Garderes et al, 2020;Narasimhan et al, 2018;Li et al, 2020b;Wang et al, 2017a,b), unstructured knowledge bases (Marino et al, 2021;Wu et al, 2022;Luo et al, 2021), and neural-symbolic inference based knowledge (Chen et al, 2020;West et al, 2021). In these methods, object detectors (Ren et al, 2015) and scene classifiers (He et al, 2016) are used to associate images with external knowledge.…”

Section: Related Workmentioning

confidence: 99%

“…In these methods, object detectors (Ren et al, 2015) and scene classifiers (He et al, 2016) are used to associate images with external knowledge. Further, external APIs, such as Google (Wu et al, 2022;Luo et al, 2021), Microsoft (Chen et al, 2021a;Yang et al, 2022) and OCR (Luo et al, 2021;Wu et al, 2022) are used to enrich the associated knowledge. Finally, pre-trained transformerbased language models (Chen et al, 2021a;Yang et al, 2022), or multimodal models (Wu et al, 2022;Luo et al, 2021;Wu et al, 2022;Garderes et al, 2020;Marino et al, 2021) are leveraged as implicit knowledge bases for answer predictions.…”

Section: Related Workmentioning

confidence: 99%

“…While part of our approach is similar to PICa (Yang et al, 2022) which considers GPT-3 as implicit knowledge base, our model takes one step further by showing that how explicit and implicit knowledge can be integrated during knowledge reasoning. Another similar work Vis-DPR (Luo et al, 2021) collects a knowledge corpus from training set by Google Search which is specific to a certain dataset. Our proposed model is more generic by collecting entities from Wikidata and not limited to the training set.…”

Section: Related Workmentioning

confidence: 99%

See 3 more Smart Citations

KAT: A Knowledge Augmented Transformer for Vision-and-Language

Gui¹,

Wang²,

Huang³

et al. 2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

The primary focus of recent work with largescale transformers has been on optimizing the amount of information packed into the model's parameters. In this work, we ask a complementary question: Can multimodal transformers leverage explicit knowledge in their reasoning? Existing, primarily unimodal, methods have explored approaches under the paradigm of knowledge retrieval followed by answer prediction, but leave open questions about the quality and relevance of the retrieved knowledge used, and how the reasoning processes over implicit and explicit knowledge should be integrated. To address these challenges, we propose a -Knowledge Augmented Transformer (KAT) -which achieves a strong state-of-theart result (+6% absolute) on the open-domain multimodal task of OK-VQA. Our approach integrates implicit and explicit knowledge in an encoder-decoder architecture, while still jointly reasoning over both knowledge sources during answer generation. Additionally, explicit knowledge integration improves interpretability of model predictions in our analysis. Code and pre-trained models are released at https://github.com/guilk/KAT.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

KAT: A Knowledge Augmented Transformer for Vision-and-Language

Gui¹,

Wang²,

Huang³

et al. 2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

show abstract

“…Result We evaluate three retrievers on OK-VQA dataset and use the knowledge base (with 112,724 pieces of knowledge) created in (Luo et al, 2021b) as the corpus. We retrieve 1/5/10/20/50/80/100 knowledge for each question.…”

Section: How To Retrieve Information Formentioning

confidence: 99%

Neural Retriever and Go Beyond: A Thesis Proposal

Luo¹

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Information Retriever (IR) aims to find the relevant documents (e.g. snippets, passages, and articles) to a given query at large scale. IR plays an important role in many tasks such as open domain question answering and dialogue systems, where external knowledge is needed. In the past, searching algorithms based on term matching have been widely used. Recently, neural-based algorithms (termed as neural retrievers) have gained more attention which can mitigate the limitations of traditional methods. Regardless of the success achieved by neural retrievers, they still face many challenges, e.g. suffering from a small amount of training data and failing to answer simple entity-centric questions. Furthermore, most of the existing neural retrievers are developed for pure-text query. This prevents them from handling multi-modality queries (i.e. the query is composed of textual description and images). This proposal has two goals. First, we introduce methods to address the abovementioned issues of neural retrievers from three angles, new model architectures, IR-oriented pretraining tasks, and generating large scale training data. Second, we identify the future research direction and propose potential corresponding solution 1 .

show abstract