Sign language recognition technology can help people with hearing impairments to communicate with those who are hearing impaired. At present, with the rapid development of society, deep learning technology also provided certain technical support for sign language recognition work. In sign language recognition tasks, the use of traditional convolutional neural networks to extract spatio-temporal features from sign language videos suffers from insufficient feature extraction, resulting in low recognition rates. Nevertheless, video-based sign language datasets are very large and require a lot of computational resources for training and generalisation must be ensured, which poses a challenge for recognition. In this paper, we have presented a video-based sign language recognition method based on resnet and lstm. As the number of network layers increases, the ResNet network can effectively solve the granularity explosion problem and obtain better time series features. We use the Resnet convolutional network as the backbone model. At the initialisation stage, we obtain sign language features using ResNet; then, the learned feature space is used as the input of LSTM network to obtain long sequence features. The experimental results show that the accuracy of the above model is better than the mainstream model, and it can effectively extract the spatio-temporal features in sign language videos and improve the recognition rate of sign language actions.