The surface quality of steel strip is a critical indicator of the quality of hot‐rolled strip, so accurate inspection of its surface is essential. However, the complex texture of steel strip surface defects makes the detection challenging. Here, a multi‐scale feature fusion and attention based YOLOv5 (MFFA‐YOLOv5) model is proposed. Specifically, the bottom layer features are up‐sampled and fused not only with the middle layer features, but also with the top layer features, so that the model better captures the surface texture information of the steel strip. Secondly, an improved attention mechanism module is introduced to deal with the global and local information of the steel strip surface by introducing down‐sampling and up‐sampling paths based on Convolutional Block Attention Module (CBAM). Meanwhile, a self‐attention mechanism path is added to improve the capability of feature representation. Experimental results on the NEU‐DET dataset show that the MFFA‐YOLOv5 model significantly outperforms other state‐of‐the‐art methods.