“…A recent flurry of work has focused on integrating vision and language, leading to creative combinations of computer vision and NLP models. Active research areas include image-caption generation (Chen et al, 2015; Vinyals et al, 2014; Xu et al, 2015), visual question answering (Agrawal et al, 2017; Das et al, 2018; Johnson, Hariharan, van Der Maaten, Fei-Fei, et al, 2017), visual question asking (Mostafazadeh et al, 2016; Rothe et al, 2017; Wang & Lake, 2021), zero-shot visual category learning (Lazaridou et al, 2015; Xian et al, 2017), and instruction following (Hill, Lampinen, et al, 2020; Ruis et al, 2020). The multimodal nature of these tasks grounds the word representations acquired by these models, as we discuss below.…”