Spatial Reasoning from Natural Language Instructions for Robot Manipulation

Venkatesh, S.; Biswas, Anirban; Upadrashta, Raviteja; Talukdar, Partha; Amrutur, Bharadwaj

doi:10.1109/icra48506.2021.9560895

Cited by 20 publications

(10 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this paper, we focus on spatial reasoning over text which can be described as inferring the implicit 1 spatial relations from explicit relations 2 described in the text. Spatial reasoning plays a crucial role in diverse domains, including language grounding (Liu et al, 2022), navigation (Zhang et al, 2021), and human-robot interaction (Venkatesh et al, 2021). By studying this task, we can analyze both the reading comprehension and logical reasoning capabilities of models.…”

Section: Introductionmentioning

confidence: 99%

SPARTQA: A Textual Question Answering Benchmark for Spatial Reasoning

Mirzaee¹,

Faghihi²,

Ning³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

This paper proposes a question-answering (QA) benchmark for spatial reasoning on natural language text which contains more realistic spatial phenomena not covered by prior work and is challenging for state-of-the-art language models (LM). We propose a distant supervision method to improve on this task. Specifically, we design grammar and reasoning rules to automatically generate a spatial description of visual scenes and corresponding QA pairs. Experiments show that further pretraining LMs on these automatically generated data significantly improves LMs' capability on spatial understanding, which in turn helps to better solve two external datasets, bAbI, and boolQ. We hope that this work can foster investigations into more sophisticated models for spatial reasoning over text.

show abstract

Section: Introductionmentioning

confidence: 99%

SPARTQA: A Textual Question Answering Benchmark for Spatial Reasoning

Mirzaee¹,

Faghihi²,

Ning³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

show abstract

“…It is common that referring expressions contain relational concepts between multiple entities in the scene, and its exploitation has been shown to improve the capability of the models to comprehend those expressions (Zender et al, 2009;Nagaraja et al, 2016;Hu et al, 2017;Shridhar et al, 2020). In particular, these relationships tend to be spatial relations from the point of reference of the user and the robot must be able to cope with this kind of descriptions in order to resolve any ambiguities there might be to eventually identify the right entity in the scene (Ding et al, 2021;Venkatesh et al, 2021;Roh et al, 2022). Ding et al (2021) present a transformer-based architecture combining the language features with a visionguided attention framework to model the global context in a multi-modal fashion.…”

Section: Spatial Referring Expressionsmentioning

confidence: 99%

Leveraging explainability for understanding object descriptions in ambiguous 3D environments

Doğan

Melsión²,

Leite³

2023

Front. Robot. AI

View full text Add to dashboard Cite

For effective human-robot collaboration, it is crucial for robots to understand requests from users perceiving the three-dimensional space and ask reasonable follow-up questions when there are ambiguities. While comprehending the users’ object descriptions in the requests, existing studies have focused on this challenge for limited object categories that can be detected or localized with existing object detection and localization modules. Further, they have mostly focused on comprehending the object descriptions using flat RGB images without considering the depth dimension. On the other hand, in the wild, it is impossible to limit the object categories that can be encountered during the interaction, and 3-dimensional space perception that includes depth information is fundamental in successful task completion. To understand described objects and resolve ambiguities in the wild, for the first time, we suggest a method leveraging explainability. Our method focuses on the active areas of an RGB scene to find the described objects without putting the previous constraints on object categories and natural language instructions. We further improve our method to identify the described objects considering depth dimension. We evaluate our method in varied real-world images and observe that the regions suggested by our method can help resolve ambiguities. When we compare our method with a state-of-the-art baseline, we show that our method performs better in scenes with ambiguous objects which cannot be recognized by existing object detectors. We also show that using depth features significantly improves performance in scenes where depth data is critical to disambiguate the objects and across our evaluation dataset that contains objects that can be specified with and without the depth dimension.

show abstract

“…Here, an application's row embodies the principal modality, whereas the column encapsulates the ancillary modality. Vision Gesture interpretation for visual navigation in VR environment [78] Spacial reasoning for robot pickup manipulation in HRC [169] Human activity recognition for safe HRC [27] Hand gesture recognition for robotic control and navigation and HMI [137] Human activity recognition for HRI [95] Vision-and-voice navigation for autonomous agent interaction with human and environment [168] Contactless force feedback and gesture tracking for enhancing the accuracy and efficiency of humanrobot manipulation tasks [45] Human position estimation for safe HRC [94] 3D object detection in autonomous driving [194] MR bidirectional communication for HRI [82] Visuo-haptic guidance for mobile collaborative robotic assistant MOCA [173] Human activity recognition for HRC in the noisy environment [151] Audio-visual scene aware dialog for human-machine conversation [165] Human activity recognition for pHRI and HRC [150] Gesture recognition for HRI and social robot [130] Predicting interactions between objects and environment by tactile and visual feedback for intelligent robotics [171] Emotion recognition for HCI [92] Bi-directional navigation intent communication for safe HRI [83] Visual-inertial hand motion tracking for HRI and VR & AR application [72] Auditory and language…”

Section: Combination Of Two Types Of Modalitiesmentioning

confidence: 99%

“…g,h) LANG-UNet algorithm for spatial reasoning based on the fusion of text and vision modalities: (g) overall illustration of spatial reasoning task; (h) the LANG-UNet modal architecture. Reproduced with permission [169]. Copyright 2021, IEEE.…”

mentioning

confidence: 99%

Multimodal Human–Robot Interaction for Human‐Centric Smart Manufacturing: A Survey

Wang,

Zheng,

et al. 2023

Advanced Intelligent Systems

View full text Add to dashboard Cite

Human–robot interaction (HRI) has escalated in notability in recent years, and multimodal communication and control strategies are necessitated to guarantee a secure, efficient, and intelligent HRI experience. In spite of the considerable focus on multimodal HRI, comprehensive disquisitions delineating various modalities and intricately analyzing their combinations remain elusive, consequently limiting holistic understanding and future advancements. This article aspires to bridge this inadequacy by conducting a profound exploration of multimodal HRI, predominantly concentrating on four principal modalities: vision, auditory and language, haptics, and physiological sensing. An extensive review encapsulating algorithmic dissection, interface devices, and applicative dimensions forms part of this discourse. This manuscript distinctively combines multimodal HRI with cognitive science, deeply probing into the three dimensions, perception, cognition, and action, thereby demystifying algorithms intrinsic to multimodal HRI. Finally, it accentuates the empirical challenges and contours preemptive trajectories for multimodal HRI in human‐centric smart manufacturing.

show abstract

Spatial Reasoning from Natural Language Instructions for Robot Manipulation

Cited by 20 publications

References 21 publications

SPARTQA: A Textual Question Answering Benchmark for Spatial Reasoning

SPARTQA: A Textual Question Answering Benchmark for Spatial Reasoning

Leveraging explainability for understanding object descriptions in ambiguous 3D environments

Multimodal Human–Robot Interaction for Human‐Centric Smart Manufacturing: A Survey

Contact Info

Product

Resources

About