The algorithm for multimodal image‐text retrieval aims to overcome the differences between visual and textual data, enabling efficient and accurate recognition between images and text. Since manually labeled data are usually expensive, many researchers attempted to use low‐quality multimodal data obtained through network batch operations. This presents a challenge for the model's generalization performance and prediction accuracy. To address this issue, we construct a system of multimodal image‐text retrieval based on the fusion of pre‐trained models. Firstly, we enhance the diversity of the original data using the MixGen algorithm to improve the model's generalization performance. Next, we employ Chinese‐CLIP as the most suitable foundational model based on comparative experiments among three different models. Finally, we construct a comprehensive ensemble model with three base Chinese‐CLIP models based on the specific characteristics of the tasks, which includes a prediction‐based fusion model for the text‐to‐image task and a feature‐based fusion model for the image‐to‐text task. Extensive experiments show that our model outperforms state‐of‐the‐art single foundation models in generalization, especially with low‐quality image‐text pairs and small datasets in the Chinese context.