Proceedings of the 28th International Conference on Computational Linguistics 2020
DOI: 10.18653/v1/2020.coling-main.64
|View full text |Cite
|
Sign up to set email alerts
|

Probing Multimodal Embeddings for Linguistic Properties: the Visual-Semantic Case

Abstract: Semantic embeddings have advanced the state of the art for countless natural language processing tasks, and various extensions to multimodal domains, such as visual-semantic embeddings, have been proposed. While the power of visual-semantic embeddings comes from the distillation and enrichment of information through machine learning, their inner workings are poorly understood and there is a shortage of analysis tools. To address this problem, we generalize the notion of probing tasks to the visual-semantic cas… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
1
1

Relationship

1
5

Authors

Journals

citations
Cited by 7 publications
(7 citation statements)
references
References 26 publications
0
7
0
Order By: Relevance
“…For instance, Cao et al [3] proposed various probing tasks to analyze VL transformers, where the authors observed modality importance during inference, and identified attention heads tailored for cross-modal interactions as well as alignments between image and text representations. Additionally, other works have proposed probing tasks to interpret VL transformers for aspects such as visual-semantics [11], verb understanding [14], and other concepts such as shape and size [26]. However, a disadvantage of probing tasks is the amount of work: additional training of the classifiers is often required, and specific task objectives must be defined to capture different embedded concepts.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…For instance, Cao et al [3] proposed various probing tasks to analyze VL transformers, where the authors observed modality importance during inference, and identified attention heads tailored for cross-modal interactions as well as alignments between image and text representations. Additionally, other works have proposed probing tasks to interpret VL transformers for aspects such as visual-semantics [11], verb understanding [14], and other concepts such as shape and size [26]. However, a disadvantage of probing tasks is the amount of work: additional training of the classifiers is often required, and specific task objectives must be defined to capture different embedded concepts.…”
Section: Related Workmentioning
confidence: 99%
“…By selecting Mean cross-modal attention from the dropdown menu (see Figure 3), a user can identify attention heads specialized in cross-modal attention. For instance, in Figure 5 the eighth attention head in layer 11 (denoted as (11,8)) has, on average, the highest attention across modalities. Thus, we focus on this specific head and plot its cross-modal (V2L and L2V) attention.…”
Section: Attention Head Summarymentioning
confidence: 99%
See 1 more Smart Citation
“…Large, pretrained models are often analysed via probe tasks or through an investigation of their attention heads (see Belinkov and Glass, 2019, for a survey). For example, Li et al (2020b) consider VisualBERT's attention heads in a manner similar to Clark et al (2019), showing that it is able to ground entities and syntactic relations (see also Ilharco et al, 2020;Dahlgren Lindström et al, 2020). Hendricks and Nematzadeh (2021) similarly seek to obtain an in-depth understanding of the representations learned by V&L models, finding that they fail to ground verbs in visual data, compared to other morphosyntactic categories.…”
Section: Related Workmentioning
confidence: 99%
“…As outlined in Sect. 2, one of the unwanted outcomes in purely based machine learning approaches is the unwanted bias and debiasing strategies may operate directly on the data or on the language model, whereas other approaches try to find evidence what, for example, vector representations actually encode (as, for example, in [6,13]). One bias-aware approach is to build hybrid systems that incorporate some structural method or algorithmic approach.…”
Section: Comparison To Other Workmentioning
confidence: 99%