2020 IEEE Winter Conference on Applications of Computer Vision (WACV) 2020
DOI: 10.1109/wacv45572.2020.9093425
|View full text |Cite
|
Sign up to set email alerts
|

Real-time Visual Object Tracking with Natural Language Description

Abstract: Tracking with natural-language (NL) specification is a powerful new paradigm to yield trackers that initialize without a manuallyspecified bounding box, stay on target in spite of occlusions, and auto-recover when diverged. These advantages stem in part from visual appearance and NL having distinct and complementary invariance properties. However, realizing these advantages is technically challenging: the two modalities have incompatible representations. In this paper, we present the first practical and compet… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
41
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 46 publications
(41 citation statements)
references
References 44 publications
(71 reference statements)
0
41
0
Order By: Relevance
“…Secondly, as depicted in Figure 1, the tracker's ability to accurately locate the target in long-term scenes has been substantially improved by taking advantage of jointly optimized the two heterogeneous features of vision and language. However, the previously published methods [15,16,29] did not achieve competitive results on the popular tracking benchmarks. We found that the current method has the following two problems: (1) The learning strategy of latent alignment between the text query and the video frame is overly simplistic, and (2) the decoded representation of 2D convolution-based models has some limitations to capture the relationship between different entities and the parts of the entity itself.…”
Section: Introductionmentioning
confidence: 84%
See 4 more Smart Citations
“…Secondly, as depicted in Figure 1, the tracker's ability to accurately locate the target in long-term scenes has been substantially improved by taking advantage of jointly optimized the two heterogeneous features of vision and language. However, the previously published methods [15,16,29] did not achieve competitive results on the popular tracking benchmarks. We found that the current method has the following two problems: (1) The learning strategy of latent alignment between the text query and the video frame is overly simplistic, and (2) the decoded representation of 2D convolution-based models has some limitations to capture the relationship between different entities and the parts of the entity itself.…”
Section: Introductionmentioning
confidence: 84%
“…Recently, tracking by natural language specification (TNL) that does not require a manually-specified bounding box for initialization has been studied [15,16,29]. Combining natural language understanding and visual object tracking has the following benefits.…”
Section: Introductionmentioning
confidence: 99%
See 3 more Smart Citations