In the practical application of action pattern recognition based on surface electromyography (sEMG) signals, the displacement of the electrodes and the time-varying characteristics of the signals during signal acquisition cross-time can reduce the classification accuracy. In this paper, a 12-day forearm sEMG signal acquisition experiment was conducted, and a cross-time gesture recognition framework based on deep convolutional neural network (CNN) with sEMG signals and short-time Fourier transform (STFT) images is proposed. In the single-day data cross-validation, the recognition rates using multiple CNN modules exceeded 90%. However, the average recognition rate for cross-day data was only 59.0%. The classification effectiveness of the framework was significantly enhanced in multi-day analysis by gradually increasing the number of training days. In particular, 97.4% accuracy was achieved in the cross-time recognition task by using a specific configuration of DenseNet as the network module and extracting features by using one-dimensional (1-D) convolution on signal fragments. Compared with the way of extracting STFT image features as input using two-dimensional (2-D) convolution, the training method of extracting signal features using 1-D convolution reduces the consumed time to about 1%, which is advantageous in terms of model performance.