The presence of a large amount of echoes significantly impairs the quality and intelligibility of speech during communication. To address this issue, numerous studies and models have been conducted to cancel echo. In this study, we propose a multi-stage acoustic echo cancellation model that utilizes an adaptive filter and a deep neural network. Our model consists of two parts: the Speex algorithm for canceling linear echo, and the multi-scale time-frequency UNet (MSTFUNet) for further echo cancellation. The Speex algorithm takes the far-end reference speech and the near-end microphone signal as inputs, and outputs the signal after linear echo cancellation. MSTFUNet takes the spectra of the far-end reference speech, the near-end microphone signal, and the output of Speex as inputs, and generates the estimated near-end speech spectrum as output. To enhance the performance of the Speex algorithm, we conduct delay estimation and compensation to the far-end reference speech. For MSTFUNet, we employ multi-scale time-frequency processing to extract information from the input spectrum. Additionally, we incorporate an improved time-frequency self-attention to capture time-frequency information. Furthermore, we introduce channel time-frequency attention to alleviate information loss during downsampling and upsampling. In our experiments, we evaluate the performance of our proposed model on both our test set and the blind test set of the Acoustic Echo Cancellation challenge. Our proposed model exhibits superior performance in terms of acoustic echo cancellation and noise reverberation suppression compared to other models.