“…And whether a model can generate diverse (Xu et al, 2018;Baheti et al, 2018), coherent (Li et al, 2016bTian et al, 2017;Bosselut et al, 2018;Adiwardana et al, 2020), informative (Shao et al, 2017;Lewis et al, 2017;Ghazvininejad et al, 2017;Young et al, 2017;Zhao et al, 2019) and knowledge-fused (Hua et al, 2020;Zhao et al, 2020;He et al, 2020) responses or not has become metrics to evaluate a dialog generation model. However, the mainly researches described above are developed on textual only and the development of multimodal dialog generation is relatively slow since the lack of large-scale datasets.…”