In recent years, the advancement of deep learning technology has led to excellent performance in synthetic aperture radar (SAR) automatic target recognition (ATR) technology. However, due to the interference of speckle noise, the task of classifying SAR images remains challenging. To address this issue, a multi-scale local–global feature fusion network (MFN) integrating a convolution neural network (CNN) and a transformer network was proposed in this study. The proposed network comprises three branches: a CovNeXt-SimAM branch, a Swin Transformer branch, and a multi-scale feature fusion branch. The CovNeXt-SimAM branch extracts local texture detail features of the SAR images at different scales. By incorporating the SimAM attention mechanism to the CNN block, the feature extraction capability of the model was enhanced from the perspective of spatial and channel attention. Additionally, the Swin Transformer branch was employed to extract SAR image global semantic information at different scales. Finally, the multi-scale feature fusion branch was used to fuse local features and global semantic information. Moreover, to overcome the problem of poor accuracy and inefficiency of the model due to empirically determined model hyperparameters, the Bayesian hyperparameter optimization algorithm was used to determine the optimal model hyperparameters. The model proposed in this study achieved average recognition accuracies of 99.26% and 94.27% for SAR vehicle targets under standard operating conditions (SOCs) and extended operating conditions (EOCs), respectively, on the MSTAR dataset. Compared with the baseline model, the recognition accuracy has been improved by 12.74% and 25.26%, respectively. The results demonstrated that Bayes-MFN reduces the inter-class distance of the SAR images, resulting in more compact classification features and less interference from speckle noise. Compared with other mainstream models, the Bayes-MFN model exhibited the best classification performance.