“…Some recent work has also investigated multimodal NLI, whereby the classification of the entailment relationship is done on the basis of image features (Xie et al, 2019;Lai, 2018), or a combination of image and textual features (Vu et al, 2018). In particular, Vu et al (2018) exploited the fact that the main portion of SNLI was created by reusing image captions from the Flickr30k dataset (Young et al, 2014) as premises, for which entailments, contradictions and neutral hypotheses were subsequently crowdsourced via Amazon Mechanical Turk (Bowman et al, 2015). This makes it possible to pair premises with the images for which they were originally written as descriptive captions, thereby reformulating the NLI problem as a Vision-Language task.…”