Non-Intrusive Load Monitoring (NILM) is a technique capable of extracting the quantity, type and operational status from overall power data in a specific area. Through energy disaggregation, the total power consumption can be decomposed into device-level power consumption. In recent years, with the advancement of deep learning techniques, numerous improvements have been made in energy disaggregation technologies. However, for NILM tasks, significant variations in device switching can result in data imbalance, and devices in different operating states can impact the model's prediction accuracy. To address these issues, this paper proposes a method based on Bidirectional Transformer and GRU, combined with time-aware self-attention for load prediction(CTA-BERT). In this model, We leverage the bidirectional transformer of BERT to learn features at different positions in the power sequence. Simultaneously, GRU is added to capture long-term sequence dependencies. Device time-state variables and dilated convolutional layers are introduced in the Transformer section to effectively capture long sequence dependencies and global features. We also enhance the masking strategy to adapt the appliances in different states. Finally, on the REDD and UK-DALE datasets, we tested and compared the proposed method with several state-ofthe-art NILM algorithms. The results indicate that the model's average absolute error (MAE) is reduced by an average of 25% across all devices, F1-Score is improved by approximately 20%, and prediction accuracy is significantly enhanced.