Part-MOT, a one-stage anchor-free architecture which unifies the object identification representation and detection in one task for visual object tracking is presented. For object representation, a position relevant feature is obtained using the center-ness information, which takes advantage of the anchor-free ideal to encode the feature map as the instanceaware embedding. To adapt to the object's movement, the clustering-based method to get the global instance feature is introduced. This enables this approach more robust to make better tracking decisions. Part-MOT achieves the state-of-the-art performance on public datasets, with especially strong results for object deformation and movement changes.This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.
The key challenge in RBGT tracking is how to fuse dual-modality information to build a robust RGB-T tracker. Motivated by CNN structure for local features, and visual transformer structure for global representations, the authors propose a two-stream hybrid structure, termed CMC 2 R, to take advantage of convolutional operations and self-attention mechanisms to lean the enhanced representation. CMC 2 R fuses local features and global representations under different resolutions through the transformer layer of the encoder block, and the two modalities are collaborated to get contextual information by the spatial and channel self-attention. The temporal association is performed with the track query, each track query models the entire track of an object, and updated frame-by-frame to build the long-range temporal relation. Experimental results show the effectiveness of the proposed method, and achieve the SOTAs performance.
The attention mechanism has produced impressive results in object tracking, but for a good trade‐off between performance and efficiency, CNN‐based approaches still dominate, owing to quadratic complexity of attention. Here, the SGF module is proposed, an efficient feature fusion block for effective object tracking with reduced linear complexity of attention. The proposed method fuses feature with attention in a coarse‐to‐fine manner. In the low‐resolution semantic branch, the top K regions with highest attention scores are selected; in the high‐resolution detail branch, attention is only calculated within regions corresponding to the top K regions. Thus, the features from the high‐resolution branch can be efficiently fused under the guidance of low‐resolution branch. Experiments on RGB and RGB‐T datasets with reformed FairMOT and MDNet+RGBT trackers demonstrated the effectiveness of the proposed method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.