The maintenance costs, productivity, health, and safety of mechanical equipment all heavily rely on the remaining usable life (RUL) of the bearings. Bearing datasets from running to failure are often in the form of life cycle sequences, which contain information about potential degradation failures in both the long and short term. Recently, the Transformer has been widely used in the RUL field due to its ability to capture some of the degradation information of the bearing. However, the Transformer is weak in acquiring local information and fails to extract temporal features from the degradation process. To solve the above problems, this paper proposes a spatio-temporal convolutional Transformer (STCT) model, which mainly consists of the dual convolutional spatio-temporal network (DCSTN) and multi-scale Transformer (MST). It not only captures the degradation features of the bearings from the temporal and spatial perspectives but also enhances the ability of the Transformer to acquire local information. We propose DCSTN as a feature extraction module, and the proposed spatio-temporal attention can capture the relevant degradation state features at different moments. In addition, MST uses a new module of multi-scale dilated causal convolution combined with multi-head attention to realize the combination of global degradation information and local contextual information capturing ability. We demonstrate the effectiveness and sophistication of the STCT model by conducting comparative experiments with ablation experiments on publicly available datasets.