“…There is a long tradition of grounding language understanding on single images, in the form of visual question answering (Goyal et al, 2017;Hudson and Manning, 2019), visual dialogue (de Vries et al, 2017;Das et al, 2017), or visual entailment (Xie et al, 2019). Recently, more and more focus has been directed to settings where the visual context consists of multiple images, either conventional static pictures (Vedantam et al, 2017;Hu et al, 2019;Suhr et al, 2019;Forbes et al, 2019;Hendricks and Nematzadeh, 2021;Yan et al, 2021;Hosseinzadeh and Wang, 2021;Bogin et al, 2021;Liu et al, 2021), or video frames (Jhamtani and Berg-Kirkpatrick, 2018a;Bansal et al, 2020). While many of these benchmarks involve just two images, COVR (Bogin et al, 2021) and ISVQA (Bansal et al, 2020) provide more images, similar to our sets of 10 images.…”