Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.469
|View full text |Cite
|
Sign up to set email alerts
|

What Does BERT with Vision Look At?

Abstract: Pre-trained visually grounded language models such as ViLBERT, LXMERT, and UNITER have achieved significant performance improvement on vision-and-language tasks but what they learn during pre-training remains unclear. In this work, we demonstrate that certain attention heads of a visually grounded language model actively ground elements of language to image regions. Specifically, some heads can map entities to image regions, performing the task known as entity grounding. Some heads can even detect the syntacti… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
177
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 151 publications
(177 citation statements)
references
References 28 publications
0
177
0
Order By: Relevance
“…Devlin et al (2019) first proposed to pre-train a large Transformer architecture, called BERT, to learn representations of natural language using large-scale unlabeled data in a self-supervised fashion. Later, BERT's task-independent pre-training approach is rigorously studied (Devlin et al, 2019;Solaiman et al, 2019;Feng et al, 2020;Li et al, 2020). While BERT-like models have shown effectiveness in learning contextualized representation, it is not very useful in generation tasks.…”
Section: Related Workmentioning
confidence: 99%
“…Devlin et al (2019) first proposed to pre-train a large Transformer architecture, called BERT, to learn representations of natural language using large-scale unlabeled data in a self-supervised fashion. Later, BERT's task-independent pre-training approach is rigorously studied (Devlin et al, 2019;Solaiman et al, 2019;Feng et al, 2020;Li et al, 2020). While BERT-like models have shown effectiveness in learning contextualized representation, it is not very useful in generation tasks.…”
Section: Related Workmentioning
confidence: 99%
“…paraphrases) and align visual with linguistic elements. Li et al (2020) hinted the importance of attention-based vision-and-language model's ability to map entity-words to corresponding image regions. Following this direction and to improve a model's reasoning abilities, we propose to further fine-tune a pre-trained model with the aim of learning visually grounded paraphrases (VGPs) (Chu et al, 2018;Otani et al, 2020).…”
Section: Vgp Fine-tuningmentioning
confidence: 99%
“…On the other hand, the internal behaviors of vision-and-language models have attracted less interest from the research community. Li et al (2020) have shown some attention heads in vision-andlanguage models are able to map entities to image regions while others even detect syntactic relations between non-entity words and image regions. Nevertheless, no initiative has been taken towards supervising directly the attention modules.…”
Section: Introductionmentioning
confidence: 99%
“…Current state-of-the-art image captioning models use a pre-trained object detector to generate features and spatial information of the objects present in an image (e.g., as in Oscar [12]). In particular, [12] utilizes BERT-like objectives to learn cross-modal representation on different vision-language tasks (similar ideas form the basis of recent pre-trained multi-modal models, e.g., VisualBERT [11]). Prior captioning approaches have involved attention mechanisms and their variants to capture spatial relationship between objects [6] for generating captions.…”
Section: Textual Description Of Images: Object Detection and Image Captioningmentioning
confidence: 99%