Robots are progressively moving into spaces that have been primarily shaped by human agency; they collaborate with human users in different tasks that require them to understand human language so as to behave appropriately in space. To this end, a stubborn challenge that we address in this paper is inferring the syntactic structure of language, which embraces grounding parts of speech (e.g., nouns, verbs, and prepositions) through visual perception, and induction of Combinatory Categorial Grammar (CCG) in situated humanrobot interaction. This could pave the way towards making a robot able to understand the syntactic relationships between words (i.e., understand phrases), and consequently the meaning of human instructions during interaction, which is a future scope of this current study.