“…Previous language and vision studies focused on the development of multimodal word and sentence representations (Bruni et al, 2012;Socher et al, 2013;Silberer and Lapata, 2014;Gong et al, 2014;Lazaridou et al, 2015), as well as methods for describing images and videos in natural language (Farhadi et al, 2010;Kulkarni et al, 2011;Mitchell et al, 2012;Socher et al, 2014;Thomason et al, 2014;Karpathy and Fei-Fei, 2014;Siddharth et al, 2014;Venugopalan et al, 2015;Vinyals et al, 2015). While these studies handle important challenges in multimodal processing of language and vision, they do not provide explicit modeling of linguistic ambiguities.…”