In audio copy-move forgery forensics, existing traditional methods typically first segment audio into voiced and silent segments, then compute the similarity between voiced segments to detect and locate forged segments. However, audio collected in noisy environments is difficult to segment and manually set, and heuristic similarity thresholds lack robustness. Existing deep learning methods extract features from audio and then use neural networks for binary classification, lacking the ability to locate forged segments. Therefore, for locating audio copy-move forgery segments, we have improved deep learning methods and proposed a robust localization model by CNN-based spectral analysis. In the localization model, the Feature Extraction Module extracts deep features from Mel-spectrograms, while the Correlation Detection Module automatically decides on the correlation between these deep features. Finally, the Mask Decoding Module visually locates the forged segments. Experimental results show that compared to existing methods, the localization model improves the detection accuracy of audio copy-move forgery by 3.0–6.8%and improves the average detection accuracy of forged audio with post-processing attacks such as noise, filtering, resampling, and MP3 compression by over 7.0%.