The acoustic-based approach is a prevalent way for non-contact fault diagnosis on Gas-Insulated Switch-gear (GIS). GIS always works under different voltages causing great diversity in acoustic frequency. However, based on the frequency principle, neural networks always focus on a specific frequency, which challenges robust fault detection on GIS. This paper introduces a novel multi-stage training method to improve the robustness of fault detection on GIS. The proposed method consists of three components: a Multi-Channel Based Frequency Regressor (MCBFR), an Audio Spectrogram Transformer Auto-Encoder (AST-AE), and a Feature Interaction Module (FIM). MCBFR and AST-AE are optimised to extract specific features from acoustics during the pre-training stage. The FIM fuses components extracted by MCBFR and AST when training the model that can indicate the final result. Also, we apply a multi-stage training strategy during the training stage to reduce the cost of potential model retraining. The efficacy of the proposed method was validated using experimental data from a real GIS, and it shows competitive performance in fault detection compared to existing methods.