The task of human motion recognition based on video is widely concerned, and its research results have been widely used in intelligent human-computer interaction, virtual reality, intelligent monitoring, security, multimedia content analysis, etc. The purpose of this study is to explore the human action recognition in the football scene combined with learning quality related multimodal features. The method used in this study is to select BN-Inception as the underlying feature extraction network and use uncontrolled environment and real world to capture datasets UCFl01 and HMDB51, and pretraining is carried out on the ImageNet dataset. The spatial depth convolution network takes image frame as input, and the temporal depth convolution network takes stacked optical flow as input to carry out human action multimodal identification. In the results of multimodal feature fusion, the accuracy of UCFl01 dataset is generally high, all of which are over 80%, and the highest is 95.2%, while the accuracy of HMDB51 dataset is about 70%, and the lowest is only 56.3%. It can be concluded that the method of this study has higher accuracy and better effect in multimodal feature acquisition, and the accuracy of single-mode feature recognition is significantly lower than that of multimodal feature recognition. It provides an effective method for the multimodal feature of human motion recognition in the scene of football or sports.