Visual Entailment Task for Visually-Grounded Language Learning

Xie, Ning; Lai, Farley; Kadav, Asim

doi:10.48550/arxiv.1811.10582

Cited by 13 publications

(16 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Downstream Tasks. We conduct a comprehensive evaluation of our models over a wide range of downstream tasks, including VQAv2 [11], GQA [20], Visual Entailment (SNLI-VE) [60], NLVR 2 [52], and Image-Text Retrieval.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

MLP Architectures for Vision-and-Language Modeling: An Empirical Study

Nie¹,

Fu²,

Gan³

et al. 2021

Preprint

View full text Add to dashboard Cite

We initiate the first empirical study on the use of MLP architectures for vision-and-language (VL) fusion. Through extensive experiments on 5 VL tasks and 5 robust VQA benchmarks, we find that: (i) Without pre-training, using MLPs for multimodal fusion has a noticeable performance gap compared to transformers; (ii) However, VL pre-training can help close the performance gap; (iii) Instead of heavy multi-head attention, adding tiny one-head attention to MLPs is sufficient to achieve comparable performance to transformers. Moreover, we also find that the performance gap between MLPs and transformers is not widened when being evaluated on the "harder" robust VQA benchmarks, suggesting using MLPs for VL fusion can generalize roughly to a similar degree as using transformers. These results hint that MLPs can effectively learn to align vision and text features extracted from lower-level encoders without heavy reliance on self-attention. Based on this, we ask an even bolder question: can we have an all-MLP architecture for VL modeling, where both VL fusion and the vision encoder are replaced with MLPs? Our result shows that an all-MLP VL model is sub-optimal compared to stateof-the-art full-featured VL models when both of them get pre-trained. However, pre-training an all-MLP can surprisingly achieve a better average score than full-featured transformer models without pre-training. This indicates the potential of large-scale pre-training of MLP-like architectures for VL modeling and inspires the future research direction on simplifying well-established VL modeling with less inductive design bias. Our code is publicly available at: https://github.com/easonnie/mlp-vil.

show abstract

Section: Methodsmentioning

confidence: 99%

“…For NLVR 2 [52], given a pair of images and a text description, the model judges the correctness of the description based on the visual clues in the image pair. For SNLI-VE [60], the model predicts whether a given image semantically entails a given sentence.…”

Section: Methodsmentioning

confidence: 99%

MLP Architectures for Vision-and-Language Modeling: An Empirical Study

Nie¹,

Fu²,

Gan³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…To classify the more fine-grained relationship than NLVR between an image and a text pair, VE aims to infer the image-to-text relationship to be true (entailment), false (contradiction) or neutral. For this task, we evaluate our model on SNLI-VE dataset [41] which is constructed based on Stanford Natural Language Inference (SNLI) [6] and Flickr30K [34] datasets. We follow [9,18] to perform the VE task as a three-way classification problem.…”

Section: Downstream Tasksmentioning

confidence: 99%

Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training

Xue¹,

Huang²,

Liu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Vision-Language Pre-training (VLP) aims to learn multi-modal representations from image-text pairs and serves for downstream vision-language tasks in a finetuning fashion. The dominant VLP models adopt a CNN-Transformer architecture, which embeds images with a CNN, and then aligns images and text with a Transformer. Visual relationship between visual contents plays an important role in image understanding and is the basic for inter-modal alignment learning. However, CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies. Thus the two objectives of learning visual relation and inter-modal alignment are encapsulated in the same Transformer network. Such design might restrict the inter-modal alignment learning in the Transformer by ignoring the specialized characteristic of each objective. To tackle this, we propose a fully Transformer visual embedding for VLP to better learn visual relation and further promote inter-modal alignment. Specifically, we propose a metric named Inter-Modality Flow (IMF) to measure the interaction between vision and language modalities (i.e., inter-modality). We also design a novel masking optimization mechanism named Masked Feature Regression (MFR) in Transformer to further promote the inter-modality learning. To the best of our knowledge, this is the first study to explore the benefit of Transformer for visual feature learning in VLP. We verify our method on a wide range of vision-language tasks, including Image-Text Retrieval, Visual Question Answering (VQA), Visual Entailment and Visual Reasoning. Our approach not only outperforms the state-of-the-art VLP performance, but also shows benefits on the IMF metric. * This work was performed when Hongwei Xue and Yupan Huang were visiting Microsoft Research as research interns.Preprint. Under review.

show abstract

“…Constructed multimodal classification tasks. In addition to image question answering/reasoning datasets already mentioned in §1, other multimodal tasks have been constructed, e.g., video QA Zellers et al, 2019), visual entailment (Xie et al, 2018), hateful multimodal meme detection (Kiela et al, 2020), and tasks related to visual dialog (de Vries et al, 2017). In these cases, unimodal baselines are shown to achieve lower performance relative to their expressive multimodal counterparts.…”

Section: Related Workmentioning

confidence: 99%

Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!

Hessel¹,

Lee²

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However, sometimes high-performing black-box algorithms turn out to be mostly exploiting unimodal signals in the data. We propose a new diagnostic tool, empirical multimodally-additive function projection (EMAP), for isolating whether or not cross-modal interactions improve performance for a given model on a given task. This function projection modifies model predictions so that cross-modal interactions are eliminated, isolating the additive, unimodal structure. For seven image+text classification tasks (on each of which we set new state-ofthe-art benchmarks), we find that, in many cases, removing cross-modal interactions results in little to no performance degradation. Surprisingly, this holds even when expressive models, with capacity to consider interactions, otherwise outperform less expressive models; thus, performance improvements, even when present, often cannot be attributed to consideration of cross-modal feature interactions. We hence recommend that researchers in multimodal machine learning report the performance not only of unimodal baselines, but also the EMAP of their best-performing model.

show abstract

Visual Entailment Task for Visually-Grounded Language Learning

Cited by 13 publications

References 18 publications

MLP Architectures for Vision-and-Language Modeling: An Empirical Study

MLP Architectures for Vision-and-Language Modeling: An Empirical Study

Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training

Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!

Contact Info

Product

Resources

About