The health status of mechanical bearings concerns the safety of equipment usage. Therefore, it is of crucial importance to monitor mechanical bearings. Currently, deep learning is the mainstream approach for this task. However, in practical situations, the majority of fault samples have the issue of severe class unbalancing, which renders conventional deep learning inapplicable. Targeted at this issue, this paper proposes an invariant temporal-spatial attention fusion network called ITSA-FN for bearing fault diagnosis under unbalanced conditions. First, the proposed method utilizes the invariant temporal-spatial attention representation section, which consists of a pretrained convolutional auto-encoder model, a convolutional block attention module, and a long short-term memory network, to extract independent features and invariant features of spatial-temporal characteristics from input signals. Then, a multilayer perceptron is used to fuse and infer from the extracted features and design a new loss function from the focal loss for network training. Finally, this article validates proposed model’s effectiveness through comparative experiments, ablation studies, and generalization performance experiments.