In aspect of the natural language processing field, previous studies have generally analyzed sound signals and provided related responses. However, in various conversation scenarios, image information is still vital. Without the image information, misunderstanding may occur, and lead to wrong responses. In order to address this problem, this study proposes a recurrent neural network (RNNs) based multi-sensor context-aware chatbot technology. The proposed chatbot model incorporates image information with sound signals and gives appropriate responses to the user. In order to improve the performance of the proposed model, the long short-term memory (LSTM) structure is replaced by gated recurrent unit (GRU). Moreover, a VGG16 model is also chosen for a feature extractor for the image information. The experimental results demonstrate that the integrative technology of sound and image information, which are obtained by the image sensor and sound sensor in a companion robot, is helpful for the chatbot model proposed in this study. The feasibility of the proposed technology was also confirmed in the experiment.