This work is to explore the application of intelligent algorithms based on deep learning in human–computer interaction systems, hoping to promote the development of human–computer interaction systems in the field of behavior recognition. Firstly, the design scheme of the human–computer interaction system is presented, and the establishment of the robot visual positioning system is emphasized. Then, the fast-region convolutional neural networks (fast-RCNN) algorithm is introduced, and it is combined with deep convolutional residual network (ResNet101). A candidate region extraction algorithm based on ResNet and long short-term memory network is proposed, and a residual network (ResNet) for spatial context memory is proposed. Both algorithms are employed in human–computer interaction systems. Finally, the performance of the algorithm and the human–computer interaction system are analyzed and characterized. The results show that the proposed candidate region extraction algorithm can significantly reduce the loss value of training set and test set after training. In addition, the corresponding accuracy, recall, and [Formula: see text]-value of the model are all above 0.98, which proves that the model has a good detection accuracy. Spatial context memory ResNet shows good accuracy in speech expression detection. The detection accuracy of single attribute, double attribute, and multi-attribute speech expression is above 89%, and the detection accuracy is good. In summary, the human–computer interaction system shows good performance in capturing target objects, even for unlabeled objects, the corresponding grasping success rate is 95%. Therefore, this work provides a theoretical basis and reference for the application of intelligent optimization algorithm in human–computer interaction system.