Crowd counting is an important research topic in computer vision. Its goal is to estimate the people's number in an image. Researchers have dramatically improved counting accuracy in recent years by regressing density maps. However, because of the inherent domain shift, the model trained on an expensive manually labelled dataset (source domain) does not perform well on a dataset with scarce labels (target domain). For this issue, a novel dynamic scale aggregation network (DSANet) is proposed to reduce the gaps in style and cross‐domain head scale variations. Specifically, a practical style transfer layer is introduced to reduce the appearance discrepancy between the source and target domains. Then, the translated source and target domain samples are encoded by a generator consisting of the VGG16 network and the dynamic scale aggregation modules (DSA Modules) and produce corresponding density maps. The DSA module can adaptively adjust parameters according to the input features and effectively fuse multi‐scale information to overcome the cross‐domain head scale variations. Next, a discriminator judges the input density map from the source or target domain. Last, domain distributions are aligned through adversarial between the generator and the discriminator. The experiments show that our network outperforms the current state‐of‐the‐art methods and can improve the target domain's performance while maintaining the source domain's performance without significant degradation.