With the development of social media, the amount of fake news has risen significantly and had a great impact on both individuals and society. The restrictions imposed by censors make the objective reporting of news difficult. Most studies use supervised methods, relying on a large amount of labeled data for fake news detection, which hinders the effectiveness of the detection. Meanwhile, the focus of these studies is on the detection of fake news in a single modality, either text or images, but actual fake news is more often in the form of text–image pairs. In this paper, we introduce a self-supervised model grounded in contrastive learning. This model facilitates simultaneous feature extraction for both text and images by employing dot product graphic matching. Through contrastive learning, it augments the extraction capability of image features, leading to a robust visual feature extraction ability with reduced training data requirements. The model’s effectiveness was assessed against the baseline using the COSMOS fake news dataset. The experiments reveal that, when detecting fake news with mismatched text–image pairs, only approximately 3% of the data are used for training. The model achieves an accuracy of 80%, equivalent to 95% of the original model’s performance using full-size data for training. Notably, replacing the text encoding layer enhances experimental stability, providing a substantial advantage over the original model, specifically on the COSMOS dataset.