Recently, deep learning technology was successfully applied to mechanical fault diagnosis. The convolutional neural network (CNN), as a prevalent deep learning model, occupies a place in intelligent fault diagnosis, which reduces the need for human feature extraction and prior knowledge, thereby achieving an end-to-end intelligent fault diagnosis model. However, the data for mechanical fault diagnosis in practical application are limited, the CNN model is too deep and too complex, making it prone to overfitting, and a model with too simple a structure and shallow layers cannot fully learn the effective features of the data. Convolutional filters with fixed window sizes are widely used in existing CNN models, which cannot flexibly select variable pivotal features. The model may be interfered with by redundant information in feature maps during training. Therefore, in this paper, a novel shallow multi-scale convolutional neural network with attention is proposed for bearing fault diagnosis. The shallow multi-scale convolutional neural network structure can fully learn the feature information of input data without overfitting. For the first time, a feature attention mechanism is developed for fault diagnosis to adaptively select features for classification more effectively, where the pivotal feature was emphasized, and the redundant feature was weakened through an attention mechanism. The time frequency representations as the input of the model were obtained from the vibration time domain signals, which contain the complete time domain and frequency domain information of the vibration signals. Compared with the current popular diagnostic methods, the results show that the proposed diagnostic method has fairly high accuracy, and its performance is superior to the existing methods. The average recognition accuracy was 99.86%, and the weak recognition rate of I-07 and I-14 labels was improved.extracted manually include kurtosis [3] and entropy [4], and the time-frequency domain features that can be extracted manually include the wavelet packet [5] and Hilbert spectrum [6]. Classification tasks are involved in fault location, mostly using k-nearest neighbor (k-NN) [3], support vector machine (SVM) [7,8], artificial neural network (ANN) [4], and other methods. Zhang et al. [9] optimized support vector machines using the inter-cluster distance (ICD) in the feature space (ICDSVM) to identify the fault types and the fault severity of bearing. The experimental results, taking into consideration various combinations of fault types, fault degrees, and loads, showed that the proposed method has high accuracy in identifying the fault type and fault degree of the bearing. Although these methods can make full use of existing human knowledge, they cannot sufficiently meet the requirements of working conditions and automation. The automatic completion of diagnostic tasks of feature extraction and classification can be overcome by new advanced artificial intelligence technology.With the rapid improvement of computational operations ...