Target recognition mainly focuses on three approaches: optical-image-based, echo-detection-based, and passive signal-analysis-based methods. Among them, the passive signal-based method is closely integrated with practical applications due to its strong environmental adaptability. Based on passive radar signal analysis, we design an “end-to-end” model that cascades a noise estimation network with a recognition network to identify working modes in noisy environments. The noise estimation network is implemented based on U-Net, which adopts a method of feature extraction and reconstruction to adaptively estimate the noise mapping level of the sample, which can help the recognition network to reduce noise interference. Focusing on the characteristics of radar signals, the recognition network is realized based on the multi-scale convolutional attention network (MSCANet). Firstly, deep group convolution is used to isolate the channel interaction in the shallow network. Then, through the multi-scale convolution module, the finer-grained features of the signal are extracted without increasing the complexity of the model. Finally, the self-attention mechanism is used to suppress the influence of low-correlation and negative-correlation channels and spaces. This method overcomes the problem of the conventional method being seriously disturbed by noise. We validated the proposed method in 81 kinds of noise environment, achieving an average accuracy of 94.65%. Additionally, we discussed the performance of six machine learning algorithms and four deep learning algorithms. Compared to these methods, the proposed MSCANet achieved an accuracy improvement of approximately 17%. Our method demonstrates better generalization and robustness.