Learning to Act with Affordance-Aware Multimodal Neural SLAM

Jia, Zhiwei; Lin, Kwei-Jay; Zhao, Yang; Gao, Qiaozi; Thattai, Govind; Sukhatme, Gaurav S.

doi:10.48550/arxiv.2201.09862

Cited by 2 publications

(3 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In simulated environments, Logeswaran et al (2022) propose a language-only finetuned GPT-2 model for task planning on ALFRED . Some end-to-end ALFRED models also have task planning as a component (Min et al, 2021;Jia et al, 2022;Blukis et al, 2022). However, this is a simpler dataset where task planning can be cast as a 7-way classification problem.…”

Section: Related Workmentioning

confidence: 99%

“…In such a system, the coffee task considered above would likely start by invoking a semantic navigation module to find the mug and a grasping module to pick it up. Some prior work has been on embodied AI benchmarks suggesting that more modular models can outperform monolithic models (Min et al, 2021;Jia et al, 2022;Zheng et al, 2022;Min et al, 2022). However, these do not evaluate and explore the limitations of individual modules.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Multimodal Embodied Plan Prediction Augmented with Synthetic Embodied Dialogue

Padmakumar,

Inan,

Gella

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Embodied task completion is a challenge where an agent in a simulated environment must predict environment actions to complete tasks based on natural language instructions and egocentric visual observations. We propose a variant of this problem where the agent predicts actions at a higher level of abstraction called a plan, which helps make agent actions more interpretable and can be obtained from the appropriate prompting of large language models. We show that multimodal transformer models can outperform language-only models for this problem but fall significantly short of oracle plans. Since collecting human-human dialogues for embodied environments is expensive and time-consuming, we propose a method to synthetically generate such dialogues, which we then use as training data for plan prediction. We demonstrate that multimodal transformer models can attain strong zero-shot performance from our synthetic data, outperforming language-only models trained on humanhuman data. * Contributions from Mert İnan and Dilek Hakkani-Tur were provided when they were employed at Amazon.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Multimodal Embodied Plan Prediction Augmented with Synthetic Embodied Dialogue

Padmakumar,

Inan,

Gella

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…We observe small performance improvements on success rate of up to 2 points when the language input is marked up with dialog acts, either at the end or start and end of an utterance, but less benefit is observed from speaker information. We believe that stronger improvements will likely be observed when using a more modular approach (eg: (Min et al, 2021)) where it is easier to decouple the effects of errors arising from language understanding from those arising from navigation which is the most difficult component when predicting such low-level actions (Blukis et al, 2022;Jia et al, 2022;Min et al, 2021).…”

Section: Execution From Dialog Historymentioning

confidence: 99%

Dialog Acts for Task Driven Embodied Agents

Gella,

Padmakumar,

Lange

et al. 2022

Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue

View full text Add to dashboard Cite

Embodied agents need to be able to interact in natural language -understanding task descriptions and asking appropriate follow up questions to obtain necessary information to be effective at successfully accomplishing tasks for a wide range of users. In this work, we propose a set of dialog acts for modelling such dialogs and annotate the TEACh dataset that includes over 3,000 situated, task oriented conversations (consisting of 39.5k utterances in total) with dialog acts. TEACh-DA is one of the first large scale dataset of dialog act annotations for embodied task completion. Furthermore, we demonstrate the use of this annotated dataset in training models for tagging the dialog acts of a given utterance, predicting the dialog act of the next response given a dialog history, and use the dialog acts to guide agent's non-dialog behaviour. In particular, our experiments on the TEACh Execution from Dialog History task where the model predicts the sequence of low level actions to be executed in the environment for embodied task completion, demonstrate that dialog acts can improve end task success rate by up to 2 points compared to the system without dialog acts.

show abstract

Learning to Act with Affordance-Aware Multimodal Neural SLAM

Cited by 2 publications

References 18 publications

Multimodal Embodied Plan Prediction Augmented with Synthetic Embodied Dialogue

Multimodal Embodied Plan Prediction Augmented with Synthetic Embodied Dialogue

Dialog Acts for Task Driven Embodied Agents

Contact Info

Product

Resources

About