“…Progress on language in this area has largely focused on grounding visual attributes (Kollar et al, 2013;Matuszek et al, 2014) and on learning spatial relations and actions for small vocabularies with hard-coded abstract concepts (Steels and Vogt, 1997;Roy, 2002;Guadarrama et al, 2013). Language is sometimes grounded into simple actions (MacMahon et al, 2006;Yu and Siskind, 2013) but the data, while multimodal, is relatively formulaic, the vocabularies are small, and the grammar is constrained.…”