With the development of social media, the internet, and sensing technologies, multimodal data are becoming increasingly common. Integrating these data into knowledge graphs can help models to better understand and utilize these rich sources of information. The basic idea of the existing methods for entity alignment in knowledge graphs is to extract different data features, such as structure, text, attributes, images, etc., and then fuse these different modal features. The entity similarity in different knowledge graphs is calculated based on the fused features. However, the structures, attribute information, image information, text descriptions, etc., of different knowledge graphs often have significant differences. Directly integrating different modal information can easily introduce noise, thus affecting the effectiveness of the entity alignment. To address the above issues, this paper proposes a knowledge graph entity alignment method based on multimodal data supervision. First, Transformer is used to obtain encoded representations of knowledge graph entities. Then, a multimodal supervised method is used for learning the entity representations in the knowledge graph so that the vector representations of the entities contain rich multimodal semantic information, thereby enhancing the generalization ability of the learned entity representations. Finally, the information from different modalities is mapped to a shared low-dimensional subspace, making similar entities closer in the subspace, thus optimizing the entity alignment effect. The experiments on the DBP15K dataset compared with methods such as MTransE, JAPE, EVA, DNCN, etc., all achieve optimal results.