Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding

Ku, Alexander; Anderson, Peter; Patel, Roma; Ie, Eugene; Baldridge, Jason

doi:10.18653/v1/2020.emnlp-main.356

Cited by 151 publications

(165 citation statements)

References 35 publications

Supporting

Mentioning

165

Contrasting

Order By: Relevance

“…The same holds for DTW measures: Ilharco et al (2019) report a success rate of 44% and corresponding SDTW of 38.3% for a fidelity-oriented version of the Reinforced Cross-modal Matching agent (Wang et al, 2019). Ku et al (2020) reports lower SDTW scores of 21% to 24%. Given this, the TC of 12.8% and SDTW of 1.4% obtained by Retouch-RCONCAT and current best results from Xiang et al (2020) (TC: 19.0%; SDTW: 16.3%), amply demonstrates the challenge of the outdoor navigation problem defined by Touchdown.…”

Section: Methodsmentioning

confidence: 83%

“…the current state-of-the-art success rate (equivalent to TC) for R2R on the validation unseen dataset is 55% (Zhu et al, 2019). It is even considerably harder than Room-across-Room dataset, which has longer, more challenging paths than R2R and success rates of 26% to 30% for three different languages (Ku et al, 2020). The same holds for DTW measures: Ilharco et al (2019) report a success rate of 44% and corresponding SDTW of 38.3% for a fidelity-oriented version of the Reinforced Cross-modal Matching agent (Wang et al, 2019).…”

Section: Methodsmentioning

confidence: 95%

See 1 more Smart Citation

Retouchdown: Releasing Touchdown on StreetLearn as a Public Resource for Language Grounding Tasks in Street View

Mehta¹,

Artzi²,

Baldridge³

et al. 2020

Proceedings of the Third International Workshop on Spatial Language Understanding

Self Cite

View full text Add to dashboard Cite

The Touchdown dataset (Chen et al., 2019) provides instructions by human annotators for navigation through New York City streets and for resolving spatial descriptions at a given location. To enable the wider research community to work effectively with the Touchdown tasks, we are publicly releasing the 29k raw Street View panoramas needed for Touchdown. We follow the process used for the StreetLearn data release (Mirowski et al., 2019) to check panoramas for personally identifiable information and blur them as necessary. These have been added to the StreetLearn dataset and can be obtained via the same process as used previously for StreetLearn. We also provide a reference implementation for both Touchdown tasks: vision and language navigation (VLN) and spatial description resolution (SDR). We compare our model results to those given in Chen et al. (2019) and show that the panoramas we have added to StreetLearn support both Touchdown tasks and can be used effectively for further research and comparison.

show abstract

Section: Methodsmentioning

confidence: 83%

Section: Methodsmentioning

confidence: 95%

Retouchdown: Releasing Touchdown on StreetLearn as a Public Resource for Language Grounding Tasks in Street View

Mehta¹,

Artzi²,

Baldridge³

et al. 2020

Proceedings of the Third International Workshop on Spatial Language Understanding

Self Cite

View full text Add to dashboard Cite

show abstract

“…Similarity between the annotated (Guide) path and the Follower path is also a natural measure of the joint quality of both the Guide and the Follower annotations. In the experiments for RxR, the path extracted from the Follower's pose trace was also used as additional supervision when training Follower agents, since it represents a step-by-step account of how a human solved the task and the visual inputs they focused on in order to do so (Ku et al, 2020).…”

Section: Pangea Toolkitmentioning

confidence: 99%

“…The release of high-quality 3D building and street captures (Chang et al, 2017;Mirowski et al, 2019;Mehta et al, 2020;Xia et al, 2018;Straub et al, 2019) has galvanized interest in developing embodied navigation agents that can operate in complex human environments. Based on these environments, annotations have been collected for a variety of tasks including navigating to a particular class of object (ObjectNav) (Batra et al, 2020), navigating from language instructions aka visionand-language navigation (VLN) (Anderson et al, 2018b;Qi et al, 2020;Ku et al, 2020), and vision-and-dialog navigation (Thomason et al, 2020;Hahn et al, 2020). To date, most of these data collection efforts have required the development of custom annotation tools.…”

Section: Introductionmentioning

confidence: 99%

“…• annotation via voice recording (in addition to text entry) • virtual pose tracking to record what annotators look at • utilities for aligning a transcript of the words heard or uttered by each annotator with their visual perceptions and actions • integration with cloud database and storage platforms • a modular API facilitating easy extension to new tasks and new environments PanGEA has already been used in two papers. It was used to collect Room-Across-Room (RxR) (Ku et al, 2020), a dataset of human-annotated navigation instructions in English, Hindi and Telugu Figure 1: Screenshots of the PanGEA Guide and Follower interfaces. In the Guide task (left), Guides explore a given path while attempting to create a navigation instruction for others to follow.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Proceedings of the Second Workshop on Advances in Language and Vision Research

2021

View full text Add to dashboard Cite

Caption translation aims to translate image annotations (captions for short). Recently, Multimodal Neural Machine Translation (MNMT) has been explored as the essential solution. Besides of linguistic features in captions, MNMT allows visual (image) featuresto be used. The integration of multimodal features reinforces the semantic representation and considerably improves translation performance. However, MNMT suffers from the incongruence between visual and linguistic features. To overcome the problem, we propose to extend MNMT architecture with a harmonization network, which harmonizes multimodal features (linguistic and visual features) by unidirectional modal space conversion. It enables multimodal translation to be carried out in a seemingly monomodal translation pipeline. We experiment on the golden Multi30k-16 and 17. Experimental results show that, compared to the baseline, the proposed method yields the improvements of 2.2% BLEU for the scenario of translating English captions into German (En→De) at best, 7.6% for the case of Englishto-French translation (En→Fr) and 1.5% for English-to-Czech (En→Cz). The utilization of harmonization network leads to the competitive performance to the-state-of-the-art. Ground-truth:Two brown horses pulling a sleigh through snow. Ground-truth:Sled dogs running and pulling a sled. Counterfeit:Two brown horses running and pulling a sled. Image captioningCross-modality learning Encoder-decoder NMT Translation in DE:: Zwei braune pferde ziehen einen schlitten durch den schnee. Linguistic feature Linguistic feature Visual linguistic 3 Preliminary 1: Attentive Encoder-Decoder NMT (Baseline) We take Bahdanau et al. (2014)'s attentive encoderdecoder NMT as the baseline. It is constructed 2

show abstract