“…Recent years have witnessed an increasing attention in visually grounded dialogues (Zarrieß et al, 2016;de Vries et al, 2018;Alamri et al, 2019;Narayan-Chen et al, 2019). Despite the impressive progress on benchmark scores and model architec-tures (Das et al, 2017b;Wu et al, 2018;Kottur et al, 2018;Gan et al, 2019;Shukla et al, 2019;Niu et al, 2019;Zheng et al, 2019;Kang et al, 2019;Murahari et al, 2019;Pang and Wang, 2020), there have also been critical problems pointed out in terms of dataset biases (Goyal et al, 2017;Chattopadhyay et al, 2017;Massiceti et al, 2018;Chen et al, 2018;Kottur et al, 2019;Kim et al, 2020;Agarwal et al, 2020) which obscure such contributions. For instance, Cirik et al (2018) points out that existing dataset of reference resolution may be largely solvable without recognizing the full referring expressions (e.g.…”