The symmetry between production efficiency and safety is a crucial aspect of industrial operations. To enhance the identification of proper safety harness use by workers at height, this study introduces a machine vision approach as a substitute for manual supervision. By focusing on the safety rope that connects the worker to an anchor point, we propose a semantic segmentation mask annotation principle to evaluate proper harness use. We introduce CEMFormer, a novel semantic segmentation model utilizing ConvNeXt as the backbone, which surpasses the traditional ResNet in accuracy. Efficient Multi-Scale Attention (EMA) is incorporated to optimize channel weights and integrate spatial information. Mask2Former serves as the segmentation head, enhanced by Poly Loss for classification and Log-Cosh Dice Loss for mask loss, thereby improving training efficiency. Experimental results indicate that CEMFormer achieves a mean accuracy of 92.31%, surpassing the baseline and five state-of-the-art models. Ablation studies underscore the contribution of each component to the model’s accuracy, demonstrating the effectiveness of the proposed approach in ensuring worker safety.