Forest fires, as severe natural disasters, pose significant threats to ecosystems and human societies, and their spread is characterized by constant evolution over time and space. This complexity presents an immense challenge in predicting the course of forest fire spread. Traditional methods of forest fire spread prediction are constrained by their ability to process multidimensional fire-related data, particularly in the integration of spatiotemporal information. To address these limitations and enhance the accuracy of forest fire spread prediction, we proposed the AutoST-Net model. This innovative encoder–decoder architecture combines a three-dimensional Convolutional Neural Network (3DCNN) with a transformer to effectively capture the dynamic local and global spatiotemporal features of forest fire spread. The model also features a specially designed attention mechanism that works to increase predictive precision. Additionally, to effectively guide the firefighting work in the southwestern forest regions of China, we constructed a forest fire spread dataset, including forest fire status, weather conditions, terrain features, and vegetation status based on Google Earth Engine (GEE) and Himawari-8 satellite. On this dataset, compared to the CNN-LSTM combined model, AutoST-Net exhibits performance improvements of 5.06% in MIou and 6.29% in F1-score. These results demonstrate the superior performance of AutoST-Net in the task of forest fire spread prediction from remote sensing images.