To address the interdependence of local time-frequency information in audio scene recognition, a segment-based time-frequency feature fusion method based on cross-attention is proposed. Since audio scene recognition is highly sensitive to individual sound events within a scene, the input features are segmented into multiple segments along the time dimension to obtain local features, allowing the subsequent attention mechanism to focus on the time slices of key sound events. Furthermore, to leverage the advantages of both convolutional neural networks (CNNs) and recurrent neural networks (RNNs), which are mainstream structures in audio scene recognition tasks, this paper employs a symmetry structure to separately obtain the time-frequency features output by CNNs and RNNs and then fuses the two sets of features using cross-attention. Experiments on the TUT2018, TAU2019, and TAU2020 datasets demonstrate that the performance of this algorithm improves the official baseline results by 17.78%, 15.95%, and 20.13%, respectively.