2018
DOI: 10.48550/arxiv.1811.10582
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Visual Entailment Task for Visually-Grounded Language Learning

Abstract: We introduce a new inference task -Visual Entailment (VE) -which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks. A novel dataset SNLI-VE (publicly available at https://github.com/necla-ml/SNLI-VE) is proposed for VE tasks based on the Stanford Natural Language Inference corpus and Flickr30k. We introduce a differentiable architecture called the Explainable Visual Entailment model (EVE) to tackle the VE prob… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
16
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 13 publications
(16 citation statements)
references
References 18 publications
0
16
0
Order By: Relevance
“…Downstream Tasks. We conduct a comprehensive evaluation of our models over a wide range of downstream tasks, including VQAv2 [11], GQA [20], Visual Entailment (SNLI-VE) [60], NLVR 2 [52], and Image-Text Retrieval.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Downstream Tasks. We conduct a comprehensive evaluation of our models over a wide range of downstream tasks, including VQAv2 [11], GQA [20], Visual Entailment (SNLI-VE) [60], NLVR 2 [52], and Image-Text Retrieval.…”
Section: Methodsmentioning
confidence: 99%
“…For NLVR 2 [52], given a pair of images and a text description, the model judges the correctness of the description based on the visual clues in the image pair. For SNLI-VE [60], the model predicts whether a given image semantically entails a given sentence.…”
Section: Methodsmentioning
confidence: 99%
“…To classify the more fine-grained relationship than NLVR between an image and a text pair, VE aims to infer the image-to-text relationship to be true (entailment), false (contradiction) or neutral. For this task, we evaluate our model on SNLI-VE dataset [41] which is constructed based on Stanford Natural Language Inference (SNLI) [6] and Flickr30K [34] datasets. We follow [9,18] to perform the VE task as a three-way classification problem.…”
Section: Downstream Tasksmentioning
confidence: 99%
“…Constructed multimodal classification tasks. In addition to image question answering/reasoning datasets already mentioned in §1, other multimodal tasks have been constructed, e.g., video QA Zellers et al, 2019), visual entailment (Xie et al, 2018), hateful multimodal meme detection (Kiela et al, 2020), and tasks related to visual dialog (de Vries et al, 2017). In these cases, unimodal baselines are shown to achieve lower performance relative to their expressive multimodal counterparts.…”
Section: Related Workmentioning
confidence: 99%