As the core component of permanent magnet motor, the magnetic tile defects seriously affect the quality of industrial motor. Automatic recognition of the surface defects of the magnetic tile is a difficult job since the patterns of the defects are complex and diverse. The existing defect recognition methods result in difficulty in practical application due to the complicated system structure and the low accuracy of the image segmentation and the target detection for the diversity of the defect patterns. A self-supervised learning (SSL) method, which benefits from its nonlinear feature extraction performance, is proposed in this study to improve the existing approaches. We proposed an efficient multihead self-attention method, which can automatically locate single or multiple defect areas of magnetic tile and extract features of the magnetic tile defects. We also designed an accurate full-connection classifier, which can accurately classify different defects of magnetic tile defects. A knowledge distillation process without labeling is proposed, which simplifies the self-supervised training process. The process of our method is as follows. A feature extraction model consists of standard vision transformer (ViT) backbone, which is trained by contrast learning without labeled dataset that is used to extract global and local features from the input magnetic tile images. Then, we use a full-connection neural network, which is trained by using labeled dataset to classify the known defect types. Finally, we combined the feature extraction model and defect classification model together to form a relatively simple integrated system. The public magnetic tile surface defect dataset, which holds 5 defect categories and 1 nondefect category, is used in the process of training, validating, and testing. We also use online data augmentation techs to increase training samples to make the model converge and achieve high classification accuracy. The experimental results show that the features extracted by the SSL method can get richer and more detailed features than the supervised learning model gets. The composite model reaches to a high testing accuracy of 98.3%, and gains relatively strong robustness and good generalization ability.