In this paper, deep learning is involved to comprehend thermoacoustic instability more deeply and achieve early warning more reliably. Flame images and pressure series are acquired in model combustors. A total of seven data domains are obtained by changing the combustor structural parameters. Then, the pre-trained model TIPE (Thermoacoustic Image-Pressure Encoder), containing an image encoder with ResNet architecture and a pressure encoder with Transformer architecture, is trained through the contrastive self-supervised task of aligning the image and pressure signals in the embedding space. Furthermore, transfer learning in thermoacoustic instability prediction is performed based on k-nearest neighbors. Results show that the pre-trained model can better resist the negative effect caused by class imbalance. The weighted F1 score of the pre-trained model is 6.72% and 2.61% larger than supervised models in zero-shot transfer and few-shot transfer, respectively. It is inferred that the more generic features encoded by TIPE result in superior generalization in comparison with traditional supervised methods. Moreover, our proposed method is insensitive to the thresholds of determining thermoacoustic states. Principal component analysis reveals the physical interpretability preliminarily through the connection between feature principal components and pressure fluctuation amplitudes. Finally, the key spatial region of flame images and temporal interval of pressure series are visualized by class activation map and global attention scores.