Abstract-To build robots that engage in fluid face-to-face spoken conversations with people, robots must have ways to connect what they say to what they see. A critical aspect of how language connects to vision is that language encodes points of view. The meaning of my left and your left differs due to an implied shift of visual perspective. The connection of language to vision also relies on object permanence. We can talk about things that are not in view. For a robot to participate in situated spoken dialog, it must have the capacity to imagine shifts of perspective, and it must maintain object permanence. We present a set of representations and procedures that enable a robotic manipulator to maintain a "mental model" of its physical environment by coupling active vision to physical simulation. Within this model, "imagined" views can be generated from arbitrary perspectives, providing the basis for situated language comprehension and production. An initial application of mental imagery for spatial language understanding for an interactive robot is described.Index Terms-Active vision, grounding, language, mental imagery, mental models, mental simulation, robots.
I. SITUATED LANGUAGE USE
IN USING language to convey meaning to listeners, speakers leverage situational context [1], [2]. Context may include many levels of knowledge ranging from the details of shared physical environments to cultural norms. As the degree of shared context decreases between communication partners, the efficiency of language also decreases since the speaker is forced to explicate increasing quantities of information that could otherwise be left unsaid. A sufficient lack of common ground can lead to communication failures.If machines are to engage in meaningful, fluent, situated spoken dialog, they must be aware of their situational context. As a starting point, we focus our attention on physical context. A machine that is aware of where it is, what it is doing, the presence and activities of other objects and people in its vicinity, and salient aspects of recent history, can use these contextual factors to interpret natural language.In numerous applications of spoken language technologies such as talking car navigation systems and speech-based control of portable devices, we envision machines that connect word meanings to the machine's immediate environments. For example, if a car navigation system could see landmarks in its vicinity based on computer vision, and anchor descriptive language to this visual perception, then the system would have a basis for generating contextually appropriate directions such as "Take a left turn immediately after the large red building." Con- sider also an assistive service robot that can lend a helping hand based on spoken requests from a human user. For the robot to properly interpret requests such as "Hand me the red cup and put it to the right of my plate," the robot must connect the meaning of verbs, nouns, adjectives, and spatial language to the robot's perceptual and action systems in a situationa...