The rapid development of the Internet of Things has exacerbated issues such as spectrum resource scarcity, poor communication quality, and high communication energy consumption. Automatic modulation recognition (AMR), a key technology in cognitive radio, has emerged as a crucial solution to these challenges. Deep neural networks have been recently applied in AMR tasks and have achieved remarkable success. However, existing deep learning‐based AMR methods often need to consider the sensitivity of models to noise fully. This study proposes a masked autoencoder multi‐scale attention feature fusion model (MAE‐SigNet). This model integrates a MAE, multi‐scale feature extraction module, bidirectional long short‐term memory module, and MAM to accomplish the AMR task under low signal‐to‐noise ratio. Additionally, we optimize the cross‐entropy loss of the MAE‐SigNet model by introducing MAE decoder reconstruction error, which enhances the model's sensitivity to noise while achieving more accurate feature representation. Experimental results demonstrate that the MAE‐SigNet model achieves average recognition rates of 63.77%, 65.28%, and 75.26% on the RML2016.10a, RML2016.10b, and RML2016.04c datasets. Mainly, MAE‐SigNet exhibits outstanding performance at various levels of low signal‐to‐noise ratios from −6 to 4 dB.