As artificial intelligence technology advances swiftly, the Transformers architecture has emerged as a pivotal model for handling multimodal data. This investigation delves into the impact of multimodal large-scale models utilizing the Transformers architecture for addressing various linguistic tasks, along with proposing optimization approaches tailored to this context. Through a series of experiments, this study scrutinized the performance of these models on multilingual datasets, engaging in a comprehensive analysis of the key determinants influencing their effectiveness. Firstly, several models of transformers architecture are pre trained on the same corpus, including ERNIE, GPT, ViT, VisualBERT, and a series of tests are carried out on these models in English, Chinese, Spanish and other languages. By comparing the performance of different models, it is found that these models show significant performance differences when dealing with tasks in different languages. Further, through analysis and experimental verification, this paper proposes a series of optimization strategies for different languages, including: annotation method for language specific datasets, incremental fine-tuning method for tuning, increasing the size of datasets, using multi task learning, etc. Experiments show that these methods have achieved remarkable results, and put forward the future research direction.