In this paper, we address multimodal language understanding with unconstrained fetching instruction for domestic service robots. A typical fetching instruction such as "Bring me the yellow toy from the white shelf" requires to infer the user intention, i.e., what object (target) to fetch and from where (source). To solve the task, we propose a Multimodal Target-source Classifier Model (MTCM), which predicts the region-wise likelihood of target and source candidates in the scene. Unlike other methods, MTCM can handle regionwise classification based on linguistic and visual features. We evaluated our approach that outperformed the state-of-the-art method on a standard data set. We also extended MTCM with Generative Adversarial Nets (MTCM-GAN), and enabled simultaneous data augmentation and classification.
This paper focuses on a multimodal language understanding method for carry-and-place tasks with domestic service robots. We address the case of ambiguous instructions, that is, when the target area is not specified. For instance "put away the milk and cereal" is a natural instruction where there is ambiguity regarding the target area, considering environments in daily life. Conventionally, this instruction can be disambiguated from a dialogue system, but at the cost of time and cumbersome interaction. Instead, we propose a multimodal approach, in which the instructions are disambiguated using the robot's state and environment context. We develop the Multi-Modal Classifier Generative Adversarial Network (MMC-GAN) to predict the likelihood of different target areas considering the robot's physical limitation and the target clutter. Our approach, MMC-GAN, significantly improves accuracy compared with baseline methods that use instructions only or simple deep neural networks.
This paper presents a novel approach to robot audition by performing robotic tasks with auditory cues. Unlike many previous works, we propose a control scheme, that does not require any explicit sound source localization. This approach is capable of controlling all the three degrees of freedom in a plane from two microphones. Built upon the sensorbased control framework, this approach relies on implicit sound source direction obtained from the time difference of arrival (TDOA). We introduce an analytical modelling of auditory cues considering a robot equipped with a pair of microphones and multiple sound sources, from which a control scheme is designed. A stability analysis is provided as well. The results obtained in simulation show the feasibility and the suitability of this method even in reverberant area.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.