The number of wheat ears in a field is an important parameter for accurately estimating wheat yield. In a large field, however, it is hard to conduct an automated and accurate counting of wheat ears because of their density and mutual overlay. Unlike the majority of the studies conducted on deep learning-based methods that usually count wheat ears via a collection of static images, this paper proposes a counting method based directly on a UAV video multi-objective tracking method and better counting efficiency results. Firstly, we optimized the YOLOv7 model because the basis of the multi-target tracking algorithm is target detection. Simultaneously, the omni-dimensional dynamic convolution (ODConv) design was applied to the network structure to significantly improve the feature-extraction capability of the model, strengthen the interaction between dimensions, and improve the performance of the detection model. Furthermore, the global context network (GCNet) and coordinate attention (CA) mechanisms were adopted in the backbone network to implement the effective utilization of wheat features. Secondly, this study improved the DeepSort multi-objective tracking algorithm by replacing the DeepSort feature extractor with a modified ResNet network structure to achieve a better extraction of wheat-ear-feature information, and the constructed dataset was then trained for the re-identification of wheat ears. Finally, the improved DeepSort algorithm was used to calculate the number of different IDs that appear in the video, and an improved method based on YOLOv7 and DeepSort algorithms was then created to calculate the number of wheat ears in large fields. The results show that the mean average precision (mAP) of the improved YOLOv7 detection model is 2.5% higher than that of the original YOLOv7 model, reaching 96.2%. The multiple-object tracking accuracy (MOTA) of the improved YOLOv7–DeepSort model reached 75.4%. By verifying the number of wheat ears captured by the UAV method, it can be determined that the average value of an L1 loss is 4.2 and the accuracy rate is between 95 and 98%; thus, detection and tracking methods can be effectively performed, and the efficient counting of wheat ears can be achieved according to the ID value in the video.