At present, the target of interest in visual tracking is given in the form of a bounding box. Due to the randomness of the target shape, the bounding box may contain a lot of non-target information. When encountering complex tracking scenarios, the performance of the tracker reduces severely. To address this problem, in this letter, the authors propose a novel tracking framework based on the joint of the visual template and natural language (VNTrack) to alleviate the impact of bounding box ambiguity. Specifically, the authors first use a pre-trained language model to extract the features of the language description of the target. Then, a feature alignment module is designed to align and enhance the visual template feature and natural language feature. In addition, the authors design a multimodal query module to fuse the visual template, natural language, and search region information. Experimental results over tracking benchmarks with language annotations show that the proposed VNTrack is competitive among the state-of-the-art trackers. Data availability statement:The data that support the findings of this study are available from the corresponding author upon reasonable request.
With the emergence of pre-trained vision-language models like CLIP, how to adapt them to various downstream classification tasks has garnered significant attention in recent research. The adaptation strategies can be typically categorized into three paradigms: zero-shot adaptation, few-shot adaptation, and the recently-proposed trainingfree few-shot adaptation. Most existing approaches are tailored for a specific setting and can only cater to one or two of these paradigms. In this paper, we introduce a versatile adaptation approach that can effectively work under all three settings. Specifically, we propose the dual memory networks that comprise dynamic and static memory components. The static memory caches training data knowledge, enabling training-free few-shot adaptation, while the dynamic memory preserves historical test features online during the testing process, allowing for the exploration of additional data insights beyond the training set. This novel capability enhances model performance in the fewshot setting and enables model usability in the absence of training data. The two memory networks employ the same flexible memory interactive strategy, which can operate in a training-free mode and can be further enhanced by incorporating learnable projection layers. Our approach is tested across 11 datasets under the three task settings. Remarkably, in the zero-shot scenario, it outperforms existing methods by over 3% and even shows superior results against methods utilizing external training data. Additionally, our method exhibits robust performance against natural distribution shifts. Codes are available at https: //github.com/YBZh/DMN .
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.