The language of space and spatial relations is a rich source of abstract semantic structure. We develop a probabilistic model that learns to understand utterances that describe spatial configurations of objects in a tabletop scene by seeking the meaning that best explains the sentence chosen. The inference problem is simplified by assuming that sentences express symbolic representations of (latent) semantic relations between referents and landmarks in space, and that given these symbolic representations, utterances and physical locations are conditionally independent. As such, the inference problem factors into a symbolgrounding component (linking propositions to physical locations) and a symbol-translation component (linking propositions to parse trees). We evaluate the model by eliciting production and comprehension data from human English speakers and find that our system recovers the referent of spatial utterances at a level of proficiency approaching human performance.
I. INTRODUCTIONImagine that a friend asks you to "Bring me the thing toward the far corner of the table." This simple request requires fairly sophisticated cognitive processing. You must first identify that she is referring to something on a table, in particular, one with corners. Then, you must orient the table to distinguish "far" vs. "near" corners. Finally, if there is more than one object near a "far" corner, but one is also very near the edge, you might tend to favor the other one, reasoning that it would have been easy to ask for "the one near the edge".We model a version of this problem, with a particular focus on recovering the referent of spatial utterances like "the thing toward the far corner of the table", where the only information available about the intended object is its position relative to a landmark in the scene. Beginning with no knowledge about the meanings of words, but equipped with a small vocabulary of spatial relations (e.g., containment, proximity, ordering in cardinal directions) and abstract representations of objects and their parts (e.g., a table can be represented as a line with ends and a middle, or as a rectangle with corners, quadrants and edges), our model learns probabilistic correspondences between sentences and abstract spatial relations between referents and landmarks by "observing" a teacher repeatedly generating an utterance and pointing to a location in space.Clearly a method that involved supervised learning based on observed propositional semantics would not be developmentally plausible, as children do not get to observe symbolic meaning directly, and so crucially, the abstract relations are never made overt to our learner. Rather, the locations are probabilistically assigned to abstract landmark-relation pairs