“…Variations of VLN include indoor navigation [3,33,66,40], street-level navigation [9,53], visionand-dialog navigation [59,74,26], VLN in continuous environments [39], and more. Notwithstanding considerable exploration of pretraining strategies [46,27,50,87], data augmentation approaches [20,21,73], agent architectures and loss functions [86,48,49], existing work in this space considers only model-free approaches. Our aim is to unlock model-based approaches to these tasks, using a visual world model to encode prior commonsense knowledge about human environments and thereby relieve the burden on the agent to learn these regularities.…”