2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017
DOI: 10.1109/cvpr.2017.777
|View full text |Cite
|
Sign up to set email alerts
|

Tracking by Natural Language Specification

Abstract: Abstract

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
181
1

Year Published

2018
2018
2019
2019

Publication Types

Select...
5
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 127 publications
(182 citation statements)
references
References 30 publications
(78 reference statements)
0
181
1
Order By: Relevance
“…Moreover, to transfer textual information to the visual domain, we rely on dynamic convolutional filters as earlier used in [27,15]. Unlike static convolutional filters that are used in conventional neural networks, dynamic filters are generated depending on the input, in our case on the encoded sentence representation.…”
Section: Language Encoding As Dynamic Filtersmentioning
confidence: 99%
“…Moreover, to transfer textual information to the visual domain, we rely on dynamic convolutional filters as earlier used in [27,15]. Unlike static convolutional filters that are used in conventional neural networks, dynamic filters are generated depending on the input, in our case on the encoded sentence representation.…”
Section: Language Encoding As Dynamic Filtersmentioning
confidence: 99%
“…Indeed, in [3], object detection with NLU evolved into instance segmentation using referring expressions. We review the state-of-theart on the task of segmentation based on natural language expressions [3][4] [5], highlighting the main contributions in the fusion of multimodal information, and then compare them against our approach.…”
Section: Related Workmentioning
confidence: 99%
“…Tracking by Natural Language Specification [5]. In this paper, the main task is object tracking in video sequences.…”
Section: Recurrent Multimodal Interactionmentioning
confidence: 99%
See 1 more Smart Citation
“…In this paper, we propose a new representation of videos that, as in the first examples, encodes the data in a general and contentagnostic manner, resulting in a long-term, robust motion representation applicable not only to action recognition, but to other video analysis tasks as well [39], [64]. This new representation distills the motion information contained in all the frames of a video into a single image, which we call the dynamic image.…”
Section: Introductionmentioning
confidence: 99%