Multiple Object Tracking (MOT) focuses on tracking all the objects in a video. Most MOT solutions follow a tracking-by-detection or a joint detection tracking paradigm to generate the object trajectories by exploiting the correlations between the detected objects in consecutive frames. However, according to our observations, considering only the correlations between the objects in the current frame and the objects in the previous frame will lead to an exponential information decay over time, thus resulting in a misidentification of the object, especially in scenes with dense crowds and occlusions. To address this problem, we propose an effectively finite-tailed updating (FTU) strategy to generate the appearance template of the object in the current frame by exploiting its local temporal context in videos. To be specific, we model the appearance template for the object in the current frame on the appearance templates of the objects in multiple earlier frames and dynamically combine them to obtain a more effective representation. Extensive experiments have been conducted, and the experimental results show that our tracker outperforms the state-of-the-art methods on MOT Challenge Benchmark. We have achieved 73.7% and 73.0% IDF1, and 46.1% and 45.0% MT on the MOT16 and MOT17 datasets, which are 0.9% and 0.7% IDFI higher, and 1.4% and 1.8% MT higher than FairMOT repsectively.
State-of-the-art purely unsupervised learning person re-ID methods first cluster all the images into multiple clusters and assign each clustered image a pseudo label based on the cluster result. Then, they construct a memory dictionary that stores all the clustered images, and subsequently train the feature extraction network based on this dictionary. All these methods directly discard the unclustered outliers in the clustering process and train the network only based on the clustered images. The unclustered outliers are complicated images containing different clothes and poses, with low resolution, severe occlusion, and so on, which are common in real-world applications. Therefore, models trained only on clustered images will be less robust and unable to handle complicated images. We construct a memory dictionary that considers complicated images consisting of both clustered and unclustered images, and design a corresponding contrastive loss by considering both kinds of images. The experimental results show that our memory dictionary that considers complicated images and contrastive loss can improve the person re-ID performance, which demonstrates the effectiveness of considering unclustered complicated images in unsupervised person re-ID.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.