In the context of the digital era, remote teaching has become an integral part of the global education system. Effective remote teaching relies on the high interactivity of interaction platforms and the precise delivery of teaching content, with multimodal image recognition technology playing a key role. This technology enhances the intelligence level of remote teaching platforms by integrating visual and textual information, providing a richer and more intuitive interactive experience for teachers and students. However, existing multimodal image recognition technologies still face challenges in accuracy, real-time performance, and semantic understanding, especially in complex teaching scenarios where the understanding and feedback on teaching content are not accurate enough, limiting the effectiveness of remote teaching interaction platforms. Addressing these limitations, this paper proposes a multimodal image alignment method based on a self-attention mechanism that effectively integrates visual information into an encoder-decoder model to achieve high consistency between images and teaching content. Additionally, a novel multimodal image annotation and recognition algorithm is introduced, considering both semantic information and visual saliency to achieve higher recognition accuracy and practicality. Experimental validation shows significant improvements in the accuracy and real-time performance of multimodal image recognition, providing strong technical support for remote teaching interaction platforms, optimizing the allocation of teaching resources, and enhancing the quality and efficiency of education.