In this study, an innovative approach based on multimodal data and the transformer model was proposed to address challenges in agricultural disease detection and question-answering systems. This method effectively integrates image, text, and sensor data, utilizing deep learning technologies to profoundly analyze and process complex agriculture-related issues. The study achieved technical breakthroughs and provides new perspectives and tools for the development of intelligent agriculture. In the task of agricultural disease detection, the proposed method demonstrated outstanding performance, achieving a precision, recall, and accuracy of 0.95, 0.92, and 0.94, respectively, significantly outperforming the other conventional deep learning models. These results indicate the method’s effectiveness in identifying and accurately classifying various agricultural diseases, particularly excelling in handling subtle features and complex data. In the task of generating descriptive text from agricultural images, the method also exhibited impressive performance, with a precision, recall, and accuracy of 0.92, 0.88, and 0.91, respectively. This demonstrates that the method can not only deeply understand the content of agricultural images but also generate accurate and rich descriptive texts. The object detection experiment further validated the effectiveness of our approach, where the method achieved a precision, recall, and accuracy of 0.96, 0.91, and 0.94. This achievement highlights the method’s capability for accurately locating and identifying agricultural targets, especially in complex environments. Overall, the approach in this study not only demonstrated exceptional performance in multiple tasks such as agricultural disease detection, image captioning, and object detection but also showcased the immense potential of multimodal data and deep learning technologies in the application of intelligent agriculture.