“…From theoretical aspects, various dialogue structures have been studied, including discourse structure (Stent, 2000;Asher et al, 2003), speech act (Austin, 1962;Searle, 1969) and common grounding (Clark, 1996;Lascarides and Asher, 2009). In dialogue system engineering, various linguistic structures have been considered and applied, including syntactic dependency (Davidson et al, 2019), predicate-argument structure (PAS) (Yoshino et al, 2011), ellipsis (Quan et al, 2019;Hansen and Søgaard, 2020), intent recognition (Silva et al, 2011;Shi et al, 2016), semantic representation/parsing (Mesnil et al, 2013;Gupta et al, 2018) and frame-based dialogue state tracking (Williams et al, 2016;El Asri et al, 2017). However, most prior work focus on dialogues where information is not grounded in external, perceptual modality such as vision.…”