Natural breeding scenes have the characteristics of a large number of cows, complex lighting, and a complex background environment, which presents great difficulties for the detection of dairy cow estrus behavior. However, the existing research on cow estrus behavior detection works well in ideal environments with a small number of cows and has a low inference speed and accuracy in natural scenes. To improve the inference speed and accuracy of cow estrus behavior in natural scenes, this paper proposes a cow estrus behavior detection method based on the improved YOLOv5. By improving the YOLOv5 model, it has stronger detection ability for complex environments and multi-scale objects. First, the atrous spatial pyramid pooling (ASPP) module is employed to optimize the YOLOv5l network at multiple scales, which improves the model’s receptive field and ability to perceive global contextual multiscale information. Second, a cow estrus behavior detection model is constructed by combining the channel-attention mechanism and a deep-asymmetric-bottleneck module. Last, K-means clustering is performed to obtain new anchors and complete intersection over union (CIoU) is used to introduce the relative ratio between the predicted box of the cow mounting and the true box of the cow mounting to the regression box prediction function to improve the scale invariance of the model. Multiple cameras were installed in a natural breeding scene containing 200 cows to capture videos of cows mounting. A total of 2668 images were obtained from 115 videos of cow mounting events from the training set, and 675 images were obtained from 29 videos of cow mounting events from the test set. The training set is augmented by the mosaic method to increase the diversity of the dataset. The experimental results show that the average accuracy of the improved model was 94.3%, that the precision was 97.0%, and that the recall was 89.5%, which were higher than those of mainstream models such as YOLOv5, YOLOv3, and Faster R-CNN. The results of the ablation experiments show that ASPP, new anchors, C3SAB, and C3DAB designed in this study can improve the accuracy of the model by 5.9%. Furthermore, when the ASPP dilated convolution was set to (1,5,9,13) and the loss function was set to CIoU, the model had the highest accuracy. The class activation map function was utilized to visualize the model’s feature extraction results and to explain the model’s region of interest for cow images in natural scenes, which demonstrates the effectiveness of the model. Therefore, the model proposed in this study can improve the accuracy of the model for detecting cow estrus events. Additionally, the model’s inference speed was 71 frames per second (fps), which meets the requirements of fast and accurate detection of cow estrus events in natural scenes and all-weather conditions.