“…Image captioning (Anderson et al, 2018b;Vinyals et al, 2015;Xu et al, 2015), visual question answering (Antol et al, 2015;Goyal et al, 2017), and visual dialog (Das et al, 2017a,b) are examples of active research areas in this field. At the same time, visual navigation (Gupta et al, 2017;Shen et al, 2019;Xia et al, 2018) and goal-oriented instruction following (Chen et al, 2019;Fu et al, 2019;Qi et al, 2020b) represent an important part of current work on embodied AI (Das et al, 2018a,b;Savva et al, 2019;Yang et al, 2019). In this context, Visionand-Language Navigation (VLN) (Anderson et al, 2018c) constitutes a peculiar challenge, as it enriches traditional navigation with a set of visually rich environments and detailed instructions.…”