Objective. Various artifacts in electroencephalography (EEG) are a big hurdle to prevent brain–computer interfaces from real-life usage. Recently, deep learning-based EEG denoising methods have shown excellent performance. However, existing deep network designs inadequately leverage inter-channel relationships in processing multi-channel EEG signals. Typically, most methods process multi-channel signals in a channel-by-channel way. Considering the correlations among EEG channels during the same brain activity, this paper proposes utilizing channel relationships to enhance denoising performance. Approach. We explicitly model the inter-channel relationships using the self-attention mechanism, hypothesizing that these correlations can support and improve the denoising process. Specifically, we introduce a novel denoising network, named spatial-temporal fusion network (STFNet), which integrates stacked multi-dimension feature extractor to explicitly capture both temporal dependencies and spatial relationships. Main results. The proposed network exhibits superior denoising performance, with a 24.27% reduction in relative root mean squared error compared to other methods on a public benchmark. STFNet proves effective in cross-dataset denoising and downstream classification tasks, improving accuracy by 1.40%, while also offering fast processing on CPU. Significance. The experimental results demonstrate the importance of integrating spatial and temporal characteristics. The computational efficiency of STFNet makes it suitable for real-time applications and a potential tool for deployment in realistic environments.