Automatic crowd counting has made significant progress in recent years. However, due to the challenge of multi-scale variations, convolutional neural networks (CNNs) with fixed-size kernels cannot effectively address this difficulty, leading to a severe limitation on counting performance. To alleviate this issue, we propose a semantic enhancement Transformer crowd counting network (named SET) to improve the semantic encoding relationships in crowd scenes. The SET integrates global attention from Transformer, learnable local attention, and inductive bias from CNNs into a counting model. Firstly, we introduce an efficient Transformer encoder to extract low-level global features of crowd scenes. Secondly, we propose a learnable ViTBlock to dynamically learn appropriate weights for different regions, aiding in enhancing the model’s global visual understanding. Finally, to guide the model to focus better on crowd regions, we jointly employ a segmentation attention module and a feature aggregation module to aggregate semantic and spatial features at multiple levels, thus obtaining finer-grained features. We conduct extensive experiments on four challenging datasets, including ShanghaiTech Part A/B, UCF-QNRF, and JHU-CROWD++, achieving excellent results.