Stack denoising autoencoder (SDAE) is suitable for acoustic signals denoising because the ability to learn high-level features automatically, but the reconstruction effect is unstable at high intensity noise. The reason is that the noise, which is emitted by the neighboring equipment, disguises the acoustic signals of the target equipment easily. It reduces the smoothness of the signal and impacting on the accuracy of the fault diagnosis. Accordingly, this paper presents a SSDAE-MobileViT model, aiming to identify the fault location and fault degree accurately and efficiently in the presence of substantial background noise interference. Firstly, a supervised stack denoising autoencoder (SSDAE) is established for reducing the high intensity noise present in the fault acoustic signals, the Huber loss between reconstructed signal and theoretical signal is employed to guide the fine-tuning of the model. Subsequently, Mel-frequency cepstral coefficient (MFCC) was used to extract the acoustic features of the reconstructed signal, and it was converted into Mel-frequency spectrogram. Finally, the MobileViT model is utilized for fault classification. Ultimately, an acoustic fault diagnosis model of rolling bearings which under high intensity noise is obtained. According to comparative experiment, the noise reduction method proposed in this paper was found to achieve the highest level of signal-to-noise ratio increment, waveform similarity coefficient, and mean square deviation in real signals when compared with the three traditional noise reduction methods. Furthermore, the average fault diagnosis accuracy of the fault diagnosis model was found to be 99.2%, which was determined to be optimal in comparison with other fault diagnosis models.