The CNN-LSTM network has a low generalization ability, and the backward relevance of actions is not strong. In this work, a convolutional self-encoding timing network with a fusion of attention mechanism, namely, convolutional block attention module (CBAM), is proposed. The model first designs a convolutional self-encoding network for pretraining to obtain feature vectors of smaller dimensions. Second, it uses the BN network to speed up the training process and enhance the network generalization ability. Then, we use the encoder part of the pretrained convolutional autoencoder, embed the attention mechanism to further focus on the weight of important parts in image features, and use Bi-LSTM to form a CNN-Bi-LSTM network. As compared with the traditional CNN-LSTM model, the proposed method continuously expands the training samples through the pretrained network to improve the generalization performance. The experimental results show that the method proposed in this paper effectively recognizes the sign language video. The recognition rate reaches 89.90%, which is higher as compared to other methods. These results verify the feasibility and effectiveness of the proposed method.