The paper presents a neurorobotics cognitive model explaining the understanding and generalisation of nouns and verbs combinations when a vocal command consisting of a verb-noun sentence is provided to a humanoid robot. The dataset used for training was obtained from object manipulation tasks with a humanoid robot platform; it includes 9 motor actions and 9 objects placing placed in 6 different locations), which enables the robot to learn to handle real-world objects and actions. Based on the multiple time-scale recurrent neural networks, this study demonstrates its generalisation capability using a large data-set, with which the robot was able to generalise semantic representation of novel combinations of noun-verb sentences, and therefore produce the corresponding motor behaviours. This generalisation process is done via the grounding process: different objects are being interacted, and associated, with different motor behaviours, following a learning approach inspired by developmental language acquisition in infants. Further analyses of the learned network dynamics and representations also demonstrate how the generalisation is possible via the exploitation of this functional hierarchical recurrent network.