“…The release of high-quality 3D building and street captures (Chang et al, 2017;Mirowski et al, 2019;Mehta et al, 2020;Xia et al, 2018;Straub et al, 2019) has galvanized interest in developing embodied navigation agents that can operate in complex human environments. Based on these environments, annotations have been collected for a variety of tasks including navigating to a particular class of object (ObjectNav) (Batra et al, 2020), navigating from language instructions aka visionand-language navigation (VLN) (Anderson et al, 2018b;Qi et al, 2020;Ku et al, 2020), and vision-and-dialog navigation (Thomason et al, 2020;Hahn et al, 2020). To date, most of these data collection efforts have required the development of custom annotation tools.…”