“…Recently, there is increasing interest in visionlanguage tasks, such as image caption Anderson et al, 2016Anderson et al, , 2018Cornia et al, 2020) and visual question answering (Ren et al, 2015a;Gao et al, 2015;Lu et al, 2016;Anderson et al, 2018). In the real world, our conversations (Chen et al, 2020b(Chen et al, , 2019 usually have multiple turns. As an extension of conventional single-turn visual question answering, Das et al (2017) introduce a multi-turn visual question answering task named visual dialogue, which aims to Q1: how many people ?…”