“…In (Antol et al, 2015), Visual7W (Zhu et al, 2016 TGIF-QA (Jang et al, 2017), TV-QA (Lei et al, 2018) IQA (Gordon et al, 2018), EQA (Wijmans et al, 2019) Image/video grounded dialogues, navigation dialogues VisDial ), GuessWhat (De Vries et al, 2017 AVSD , CVDN (Thomason et al, 2019) Synthetic image/video QA SHAPE (Andreas et al, 2016), CLEVR SVQA (Song et al, 2018), CLEVRER (Yi* et al, 2020) Synthetic dialogues bAbI (Bordes et al, 2017) MNIST Dialog (Seo et al, 2017), CLEVR-Dialog (Kottur et al, 2019)…”