Recently, high spatial resolution remote sensing image scene classification has had a wide range of applications and has become one of the hotspots in the field of remote sensing research. Due to the complexity of the scenes in remote sensing images, it is impossible to annotate all ground object classes at once. To adapt to different application scenarios, high spatial resolution remote sensing image scene classification models need to have zero-shot generalization ability for unseen classes. To improve the zero-shot generalization ability of classification models, the existing methods often start from the perspective of image features, thus ignoring the high-order semantic information in the scene. In fact, the association between higher-order semantic information in the scene is very important for the generalization ability of the classification model. People often use image information and its corresponding higher-order semantic information to complete remote sensing image scene understanding. Therefore, this work proposes a text guided remote sensing image pre-training model for zero-shot classification of high spatial resolution remote sensing image scenes. First, the transformer model is used to extract the embedded features of text and remote sensing images. Then, based on the aligned text and remote sensing image data, a contrast learning method is used to train the model to learn the correspondence between text and image features. After the model training is completed, the nearest neighbor method is used to complete zero-shot classification on the target data. The effectiveness of the proposed method was verified on three remote sensing image scene classification benchmark datasets.