DialFRED: Dialogue-Enabled Agents for Embodied Instruction Following

Gao, Xin; Gao, Qiaozi; Gong, Ran; Lin, Kwei-Jay; Thattai, Govind; Sukhatme, Gaurav S.

doi:10.1109/lra.2022.3193254

Cited by 21 publications

(29 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In Lynch et al (2022) the authors present a RL and imitation based framework, Interactive Language, that is capable of continuously adjusting its behavior to natural language based instructions in a real-time interactive setting. There have also been a recent wave of datasets and benchmarks created by utilizing 3D household simulators and crowd sourcing tools to collect large-scale task-oriented dialogue aimed at improving the interactive language capabilities of embodied task-oriented agents Padmakumar et al (2022), Gao et al (2022), Team et al (2021). Most of the above mentioned works focus on the verbal mode of communication and largely on the comprehension side (e.g., instruction following).…”

Section: Rl For Communication In Task-oriented Embodied Agentsmentioning

confidence: 99%

See 1 more Smart Citation

Learning to generate pointing gestures in situated embodied conversational agents

et al. 2023

View full text Add to dashboard Cite

One of the main goals of robotics and intelligent agent research is to enable them to communicate with humans in physically situated settings. Human communication consists of both verbal and non-verbal modes. Recent studies in enabling communication for intelligent agents have focused on verbal modes, i.e., language and speech. However, in a situated setting the non-verbal mode is crucial for an agent to adapt flexible communication strategies. In this work, we focus on learning to generate non-verbal communicative expressions in situated embodied interactive agents. Specifically, we show that an agent can learn pointing gestures in a physically simulated environment through a combination of imitation and reinforcement learning that achieves high motion naturalness and high referential accuracy. We compared our proposed system against several baselines in both subjective and objective evaluations. The subjective evaluation is done in a virtual reality setting where an embodied referential game is played between the user and the agent in a shared 3D space, a setup that fully assesses the communicative capabilities of the generated gestures. The evaluations show that our model achieves a higher level of referential accuracy and motion naturalness compared to a state-of-the-art supervised learning motion synthesis model, showing the promise of our proposed system that combines imitation and reinforcement learning for generating communicative gestures. Additionally, our system is robust in a physically-simulated environment thus has the potential of being applied to robots.

show abstract

Section: Rl For Communication In Task-oriented Embodied Agentsmentioning

confidence: 99%

“…There have also been a recent wave of datasets and benchmarks created by utilizing 3D household simulators and crowd sourcing tools to collect large-scale task-oriented dialogue aimed at improving the interactive language capabilities of embodied task-oriented agents Padmakumar et al. (2022) , Gao et al. (2022) , Team et al.…”

Section: Introductionmentioning

confidence: 99%

Learning to generate pointing gestures in situated embodied conversational agents

et al. 2023

View full text Add to dashboard Cite

show abstract

“…Similarly, collects a large dataset of CRs to user requests, augmented synthetically, in a multiple-step process without interaction. Another large-scale dataset with 53k task-relevant questions and answers about an instruction was constructed Gao et al (2022). However, the data is created by an annotator that does not have to act, but only watches execution videos, asking a question they think would be helpful and then answering their own question.…”

Section: Related Literaturementioning

confidence: 99%

Instruction Clarification Requests in Multimodal Collaborative Dialogue Games: Tasks, and an Analysis of the CoDraw Dataset

Brielen¹,

Schlangen²

2023

Preprint

View full text Add to dashboard Cite

In visual instruction-following dialogue games, players can engage in repair mechanisms in face of an ambiguous or underspecified instruction that cannot be fully mapped to actions in the world. In this work, we annotate Instruction Clarification Requests (iCRs) in CoDraw, an existing dataset of interactions in a multimodal collaborative dialogue game. We show that it contains lexically and semantically diverse iCRs being produced self-motivatedly by players deciding to clarify in order to solve the task successfully. With 8.8k iCRs found in 9.9k dialogues, CoDraw-iCR (v1) is a large spontaneous iCR corpus, making it a valuable resource for data-driven research on clarification in dialogue. We then formalise and provide baseline models for two tasks: Determining when to make an iCR and how to recognise them, in order to investigate to what extent these tasks are learnable from data. 1 T: above the tree is a cloud with lightning 2 D: small size ? 3 T: it fits right above the tree so the whole cloud is seen and the bolt is just above the top of the tree 4 D: got it and 5 T: to the left of the cloud is a air balloon with a very tip of the top off screen 6 D: is it large or small in size ? 7 T: maybe medium 8 D: done what else 9 T: to the left of the balloon is another regular cloud about one inch from the left 10 D: okay and 11 T: just left of center in the green is a medium girl facing right 12 D: expression of the girl ? what side is she facing ? 13 T: she is standing with a sad face and one hand facing out . she is facing the tree 14 D: got it and

show abstract

“…However, the reasoning mainly focuses on the outcome or the history of the navigation on 2D images and does not require a holistic 3D understanding of the environment. There are also works [12,20,51,54,57,69] targeting instruction following in embodied environments, in which an agent is asked to perform a series of tasks based on language instructions. Different from their settings, for our benchmark an embodied agent actively explores the environment and takes multi-view images for 3D-related reasoning.…”

Section: Related Workmentioning

confidence: 99%