Currently, machine learning techniques are widely used in structural seismic response studies. The developed network models for various types of seismic response provide new ways to analyse seismic hazards. However, it is not easy to balance the applicability of the input, accuracy, and computational efficiency for existing network models. In this paper, a neural network model containing an efficient self-adaptive feature extraction module (AFEM) is designed. It can recognize time-frequency features from ground motion (GM) inputs for structural seismic response prediction tasks while considering the model’s computational accuracy and computational cost. The self-adaptive feature extraction module is constructed based on the MFCCs feature extraction process in NLP. AFEM recognizes time-frequency features closely related to structures’ behaviour and response under dynamic loads. Taking the seismic response prediction of a typical building as the target task, the neural network configuration, including a baseline model M0 and three comparison models (M1, M2, and M3) with AFEM, is systematically analysed. The results demonstrate that the proposed M1 model with initial AFEM, the M2 model with combined amplitude and phase features, and the M3 model with a complex-valued network are more adaptable than the baseline model to the target task. The extracted amplitude and phase features by the M3 model’s AFEM significantly improve model validation accuracy by 8.6% while reducing computation time by 11.4%. It could provide the basis for future research on regional earthquake damage intelligence assessment systems.