“…There is some work that presents a bayesian probabilistic formulation to learn referential grounding in dialog (Liu et al, 2014), user preferences (Cadilhac et al, 2013), color descriptions (McMahan and Stone, 2015Andreas and Klein, 2014). A huge chunk of work also focus on leveraging attention mechanism for grounding multimodal phenomenon in images (Srinivasan et al, 2020;Chu et al, 2018;Fan et al, 2019;Vu et al, 2018;Kawakami et al, 2019;Dong et al, 2019), videos (Lei et al, 2020; and navigation of embodied agents (Yang et al, 2020), etc., Some approach this using data structures such as graphs in the domains of grounding images (Chang et al, 2015;Liu et al, 2014), videos ), text (Laws et al, 2010;Chen, 2012;Massé et al, 2008), entities (Zhou et al, 2018a), knowledge graphs and ontologies (Jauhar et al, 2015;Zhang et al, 2020) and interactive settings Jauhar et al (2015); Xu et al (2020).…”