Digital Audio tampering detection can be applied to verify the authenticity of digital audio. However, the current methods are mostly based on visual comparison analysis of the continuity of electronic network frequency (ENF) of digital audio with a standard ENF database. It is usually tricky to obtain the ENF database, and the feature expression of the visualization method is weak, which leads to low detection accuracy. In order to solve this problem, this paper proposed an audio tampering detection method based on the fusion of shallow and deep features. Firstly, the band-pass filtering process is performed on the audio signal to obtain the ENF components, and then the discrete Fourier transform and Hilbert transform are applied to obtain the phase and instantaneous frequency of the ENF components. Secondly, the shallow features are extracted by performing framing and fitting operations on the estimated phase and instantaneous frequency. Then, the designed convolutional neural network is used to obtain deep features, and the attention mechanism is applied to fuse shallow features and deep features. Finally, after dimensionality reduction through the fully connected layer, the Softmax layer is used for classification to detect the tampering audio. The method achieves 97.03% accuracy on three classic databases, which are Carioca 1, Carioca 2, and New Spanish. In addition, we have achieved an accuracy of 88.31% on the newly constructed database GAUDI-DI. Experimental results show that the proposed method is superior to the state-of-the-art method.