“…Audio artifact removal aims to subtract noise from the sound signals while improving the intelligibility and quality of the sound signal. Deep learning-based audio diagnosis techniques usually present artificial residual noises, particularly as the information phase is ignored in training targets [ 104 ], for example, the magnitude of the clean speech and its variations [ 111 , 112 ], or the ideal ratio mask [ 113 , 114 ]. This type of noise is typically extremely non-stationary, and in the middle-high frequency region, when audio power spectral density (PSD) is low, it nevertheless has a sizable power.…”