Scheduled Policy Optimization for Natural Language Communication with Intelligent Agents

Xiong, Wenhan; Guo, Xiaoxiao; Yu, Mo; Chang, Shiyu; Zhou, Bowen; Wang, William Yang

doi:10.24963/ijcai.2018/626

Cited by 5 publications

(7 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Mapping instruction to action has been studied extensively with intermediate symbolic representations (e.g., Chen and Mooney, 2011; Kim and Mooney, 2012;Artzi and Zettlemoyer, 2013;Artzi et al, 2014;Misra et al, 2015Misra et al, , 2016. Recently, there has been growing interest in direct mapping from raw visual observations to actions (Misra et al, 2017;Xiong et al, 2018;Anderson et al, 2018;Fried et al, 2018). We propose a model that enjoys the benefits of such direct mapping, but explicitly decomposes that task to interpretable goal prediction and action generation.…”

Section: Related Workmentioning

confidence: 99%

“…Executing instructions in interactive environments requires mapping natural language and observations to actions. Recent approaches propose learning to directly map from inputs to actions, for example given language and either structured observations (Mei et al, 2016;Suhr and Artzi, 2018) or raw visual observations (Misra et al, 2017;Xiong et al, 2018). Rather than using a combination of models, these approaches learn a single model to solve language, perception, and planning challenges.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction

Misra¹,

Bennett²,

Blukis³

et al. 2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

125

126

View full text Add to dashboard Cite

We propose to decompose instruction execution to goal prediction and action generation. We design a model that maps raw visual observations to goals using LINGUNET, a language-conditioned image generation network, and then generates the actions required to complete them. Our model is trained from demonstration only without external resources. To evaluate our approach, we introduce two benchmarks for instruction following: LANI, a navigation task; and CHAI, where an agent executes household instructions. Our evaluation demonstrates the advantages of our model decomposition, and illustrates the challenges posed by our new benchmarks.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction

Misra¹,

Bennett²,

Blukis³

et al. 2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

125

126

View full text Add to dashboard Cite

show abstract

“…Language grounding refers to interpreting language in a situated context and includes collaborative language grounding toward situated humanrobot dialog (Chai et al, 2016), city exploration (Boye et al, 2014), as well as following high-level navigation instructions . Mapping instructions to low level actions has been explored in structured environments by mapping raw visual representations of the world and text onto actions using using Reinforcement Learning methods (Misra et al, 2017;Xiong et al, 2018;Huang et al, 2019). This work has recently been extended to controlling autonomous systems and robots through human language instruction in a 3D simulated environment (Ma et al, 2019;Misra et al, 2018;Blukis et al, 2019) and Mixed Reality (Huang et al, 2019) and using imitation learning .…”

Section: Related Workmentioning

confidence: 99%

Learning to Read Maps: Understanding Natural Language Instructions from Unseen Maps

Katsakioris¹,

Konstas²,

Mignotte³

et al. 2021

Proceedings of Second International Combined Workshop on Spatial Language Understanding and Grounded Communication for Robotics

View full text Add to dashboard Cite

Robust situated dialog requires the ability to process instructions based on spatial information, which may or may not be available. We propose a model, based on LXMERT, that can extract spatial information from text instructions and attend to landmarks on Open-StreetMap (OSM) referred to in a natural language instruction. Whilst, OSM is a valuable resource, as with any open-sourced data, there is noise and variation in the names referred to on the map, as well as, variation in natural language instructions, hence the need for datadriven methods over rule-based systems. This paper demonstrates that the gold GPS location can be accurately predicted from the natural language instruction and metadata with 72% accuracy for previously seen maps and 64% for unseen maps.

show abstract

“…Misra et al [21] formulate navigation as a sequential-decision process and propose to use reward shaping to effectively train the RL agent. In the same environment, Xiong et al [37] propose a scheduled training mechanism which yields more efficient exploration and achieves better results. However, these methods still operate in synthetic environments and consider either simple discrete observation inputs or unrealistic top-down view of the environment.…”

Section: Related Workmentioning

confidence: 99%

Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation

Wang

Xiong

Wang

et al. 2018

Lecture Notes in Computer Science

Self Cite

165

133

View full text Add to dashboard Cite

Existing research studies on vision and language grounding for robot navigation focus on improving model-free deep reinforcement learning (DRL) models in synthetic environments. However, model-free DRL models do not consider the dynamics in the real-world environments, and they often fail to generalize to new scenes. In this paper, we take a radical approach to bridge the gap between synthetic studies and real-world practices-We propose a novel, planned-ahead hybrid reinforcement learning model that combines model-free and model-based reinforcement learning to solve a real-world vision-language navigation task. Our look-ahead module tightly integrates a look-ahead policy model with an environment model that predicts the next state and the reward. Experimental results suggest that our proposed method significantly outperforms the baselines and achieves the best on the real-world Room-to-Room dataset. Moreover, our scalable method is more generalizable when transferring to unseen environments.

show abstract

Scheduled Policy Optimization for Natural Language Communication with Intelligent Agents

Cited by 5 publications

References 6 publications

Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction

Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction

Learning to Read Maps: Understanding Natural Language Instructions from Unseen Maps

Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation

Contact Info

Product

Resources

About