With the acceleration of urbanization in agricultural areas and the continuous changes in land-use patterns, the transformation of agricultural land presents complexity and dynamism, which puts higher demands on precise monitoring. And most existing monitoring methods are constrained by limited spatial and temporal resolution, high computational demands, and challenges in distinguishing complex land cover types. These limitations hinder their ability to effectively detect rapid and subtle land use changes, particularly in areas experiencing rapid urban expansion, where their shortcomings become more pronounced. To address these challenges, this study presents a multimodal deep learning framework using a temporal semantic segmentation change detection (TSSCD) model optimized with ant colony optimization (ACO) to detect and analyze agricultural land conversion in Zhengzhou City, a major grain-producing area in China. This model utilizes Landsat 7/8 imagery and Sentinel-2 satellite imagery from 2003 to 2023 to capture the spatiotemporal transformation of cropland driven by urban expansion, infrastructure development, and population changes over the last two decades. The optimized TSSCD model achieves superior classification accuracy, with the kappa coefficient improving from 0.871 to 0.892, spatial F1 score from 0.903 to 0.935, and temporal F1 score from 0.848 to 0.879, indicating its effectiveness in identifying complex land-use changes. The significant spatiotemporal variation characteristics of agricultural land conversion in Zhengzhou City from 2003 to 2023 were revealed through the TSSCD model, with transformations initially concentrated near Zhengzhou’s urban core and expanding outward, particularly to the east and north. These results highlight the effectiveness of remote sensing and deep learning techniques in monitoring agricultural land conversion.