In agricultural production, pest problems are inevitable. In recent years, China has suffered annual losses of up to 40 million tons of grain due to various pests and diseases. There are over one million known insect species in the natural world, exhibiting complex and diverse morphologies, making manual identification costly. With the advancement of deep learning technologies, methods relying solely on image data for crop pest identification have achieved some success. However, these methods heavily depend on numerous high-quality annotated images. Many existing approaches overlook the role and value of other modalities beyond images and rely solely on low-level image features for recognition. They fail to fully exploit semantic correlations between multimodal data, leading to poor reliability and interpretability of the recognition results. Moreover, in real-world scenarios, pest habitats are complex and diverse, with varying scales, imposing high demands on the practical application of models. To address these challenges, this paper proposes a Multi-Scale Cross-Modal Feature Fusion Model (ITFNet-API) utilizing multi-scale visual features and textual features for cross-modal attention learning. Super-resolution reconstruction techniques are introduced to restore high-frequency information in low-quality images. To avoid altering low-frequency information during the reconstruction process, the reconstructed high-quality images are combined with the original images and fed into the model for feature learning to enrich pest features. The proposed Image and Text Fusion Module (ITFM) performs multi-scale cross-modal information fusion and an Inverse Transpose Convolution (ITC) network module is introduced to restore multi-scale visual features. The obtained visual features are input into the neck and head networks of the YOLOv4 (You Only Look Once version 4) model to obtain pest category and location information in each image. Finally, to verify the model’s practical application capability, this paper introduces an Arbitrary Combination Image Enhancement (ACIE) data augmentation method which generates a batch of complex multi-target, multi-scale pest images in challenging scenarios for data augmentation and model training. We conducted extensive comparative experiments based on the IP102 dataset to validate the effectiveness of the proposed method. The experimental results indicate that our method achieved an accuracy rate of 82.15% in pest recognition, outperforming other advanced methods significantly with an AP50 of 92.18%. Moreover, in multi-target recognition,the accuracy rate reached 45.24% with an AP50 of 43.04%.