Recurrent neural networks have recently shown significant potential in different language applications, ranging from natural language processing to language modelling. This paper introduces a research effort to use such networks to develop and evaluate natural language acquisition on a humanoid robot. Here, the problem is twofold. First, the focus will be put on using the gesture-word combination stage observed in infants to transition from single to multi-word utterances. Secondly, research will be carried out in the domain of connecting action learning with language learning. In the former, the long-short term memory architecture will be implemented, whilst in the latter multiple time-scale recurrent neural networks will be used. This will allow for comparison between the two architectures, whilst highlighting the strengths and shortcomings of both with respect to the language learning problem. Here, the main research efforts, challenges and expected outcomes are described.