Accurately predicting vehicle trajectories is crucial for motion planning in autonomous driving. Some existing trajectory prediction models employ Long Short-Term Memory (LSTM) networks to encode trajectory information, but this approach may lead to information loss. Additionally, the sparsity of the grid representation may limit the model’s ability to extract vehicle interaction features. To address these issues, this paper proposes a vehicle trajectory prediction model based on cross-attention and multilevel spatio-temporal features. The proposed model incorporates a spatio-temporal feature extraction module designed to capture the dynamic evolution of vehicle states across both time and space. Furthermore, it employs a temporal cross-attention module that reuses hidden states to select spatio-temporal features closely related to the target vehicle’s historical state, thus compensating for the temporal information loss during the encoding process. Additionally, a spatial cross-attention module, informed by social context, is deployed to analyze the dynamic interactions among vehicles. This module filters spatio-temporal interaction features with a significant impact on the target vehicle’s trajectory. The model is trained and tested using the NGSIM dataset and the highD dataset. Compared with other models, the model presented in this paper shows higher prediction accuracy in long-term trajectory prediction.