“…Visio-linguistic stress testing. There are a number of existing multimodal stress tests about correctly understanding implausible scenes [13], exploitation of language and vision priors [11,27], single word mismatches [64], hate speech detection [26,32,41,92], memes [39,75], ablation of one modality to probe the other [22], distracting models with visual similarity between images [7,33], distracting models with textual similarity between many suitable captions [1,17], collecting more diverse image-caption pairs beyond the predominately English and North American/Western European datasets [50], probing for an understanding of verb-argument relationships [30], counting [53], or specific model failure modes [65,69]. Many of these stress tests rely only on synthetically generated images, often with minimal visual differences, but no correspondingly minimal textual changes [80].…”