Self-Motivated Communication Agent for Real-World Vision-Dialog Navigation

Zhu, Yi; Weng, Yue; Zhu, Fengda; Liang, Xiaodan; Ye, Qixiang; Lu, Yutong; Jiao, J. B.

doi:10.1109/iccv48922.2021.00162

Cited by 21 publications

(12 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…An intelligent agent asks for help when uncertain about the next action (Nguyen et al, 2021b). Action probabilities or a separately trained model (Chi et al, 2020;Zhu et al, 2021e;Nguyen et al, 2021a) can be leveraged to decide whether to ask for help. Using natural language to converse with the oracle covers a wider problem scope than sending a signal.…”

Section: Asking For Helpmentioning

confidence: 99%

Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

Gu,

Stefani,

et al. 2022

Preprint

View full text Add to dashboard Cite

A long-term goal of AI research is to build intelligent agents that can communicate with humans in natural language, perceive the environment, and perform real-world tasks. Visionand-Language Navigation (VLN) is a fundamental and interdisciplinary research topic towards this goal, and receives increasing attention from natural language processing, computer vision, robotics, and machine learning communities. In this paper, we review contemporary studies in the emerging field of VLN, covering tasks, evaluation metrics, methods, etc. Through structured analysis of current progress and challenges, we highlight the limitations of current VLN and opportunities for future work. This paper serves as a thorough reference for the VLN research community. 1

show abstract

Section: Asking For Helpmentioning

confidence: 99%

Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

Gu,

Stefani,

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Almost all instruction following dialogue tasks need to consider both contextual information and actions as well as the state of the world (Suhr and Artzi, 2018;Lachmy et al, 2021), which remains a key challenge for instruction following dialogue tasks. In particular, the Vision-and-Dialog Navigation (VDN) task Roman et al, 2020;Zhu et al, 2021) where the question-answering dialogue and visual contexts are leveraged to facilitate navigation, has attracted increasing research attention. Other tasks, such as moving blocks tasks (Misra et al, 2017) and object finding tasks (Janner et al, 2018), also require the modelling of both contextual information in natural language as well as the world state representation to be solved.…”

Section: Related Work and Backgroundmentioning

confidence: 99%

Learning to Execute Actions or Ask Clarification Questions

Shi¹,

Feng²,

Lipani³

2022

Preprint

View full text Add to dashboard Cite

Collaborative tasks are ubiquitous activities where a form of communication is required in order to reach a joint goal. Collaborative building is one of such tasks. We wish to develop an intelligent builder agent in a simulated building environment (Minecraft) that can build whatever users wish to build by just talking to the agent. In order to achieve this goal, such agents need to be able to take the initiative by asking clarification questions when further information is needed. Existing works on Minecraft Corpus Dataset only learn to execute instructions neglecting the importance of asking for clarifications. In this paper, we extend the Minecraft Corpus Dataset by annotating all builder utterances into eight types, including clarification questions, and propose a new builder agent model capable of determining when to ask or execute instructions. Experimental results show that our model achieves state-of-the-art performance on the collaborative building task with a substantial improvement. We also define two new tasks, the learning to ask task and the joint learning task. The latter consists of solving both collaborating building and learning to ask tasks jointly.

show abstract

“…The problem of instruction following for navigation has drawn significant attention in a wide range of domains. These include Google Street View Panoramas [11], simulated environments for quadcopters [5], multilingual settings [33], interactive vision-dialogue setups [60], real world scenes [3], and realistic simulations of indoor scenes [4]. More relevant to our work is the literature on the Vision-and-Language Navigation (VLN) task initially defined in [4] on navigation graphs (R2R) in Matterport3D [8] dataset, and then converted for continuous environments in [32] (VLN-CE).…”

Section: Related Workmentioning

confidence: 99%

Cross-modal Map Learning for Vision and Language Navigation

Georgakis¹,

Schmeckpeper²,

Wanchoo³

et al. 2022

Preprint

View full text Add to dashboard Cite

We consider the problem of Vision-and-Language Navigation (VLN). The majority of current methods for VLN are trained end-to-end using either unstructured memory such as LSTM, or using cross-modal attention over the egocentric observations of the agent. In contrast to other works, our key insight is that the association between language and vision is stronger when it occurs in explicit spatial representations. In this work, we propose a cross-modal map learning model for vision-and-language navigation that first learns to predict the top-down semantics on an egocentric map for both observed and unobserved regions, and then predicts a path towards the goal as a set of waypoints. In both cases, the prediction is informed by the language through cross-modal attention mechanisms. We experimentally test the basic hypothesis that language-driven navigation can be solved given a map, and then show competitive results on the full VLN-CE benchmark.

show abstract

Self-Motivated Communication Agent for Real-World Vision-Dialog Navigation

Cited by 21 publications

References 20 publications

Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

Learning to Execute Actions or Ask Clarification Questions

Cross-modal Map Learning for Vision and Language Navigation

Contact Info

Product

Resources

About