Structural response prediction under earthquakes is crucial for evaluating the structural performance and subsequent functional restoration. Deep learning provides the potential to rapidly obtain the responses by skipping the time‐consuming nonlinear finite element analysis. However, a single deep learning network may only predict the time history responses of one specific structure, resulting in redundancy and resource waste when building multiple networks for modeling different structures. Thus, this study proposes a Structure Temporal Fusion Network (STFN) that can predict responses of various homogeneous structures using a single network. The key concept is that the seismic waves and the structural characteristics, such as story numbers, are fused together to predict diverse time history responses. Two numeric experiments are conducted, including predicting responses of ideal single‐degree‐of‐freedom (SDOF) structures and regular multistory reinforced concrete frames. Furthermore, a series of ablation analyses are carried out to validate the network architecture. The results indicate that STFN can predict nonlinear time history responses of different structures with mean square errors in the magnitude of and for two experiments, respectively. The solutions also highlight the importance of fusing static characteristics for the modeling of various structures with only one network. The STFN presents a promising solution for time history response prediction across multiple structures in regions.