2022
DOI: 10.48550/arxiv.2210.02953
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Video Referring Expression Comprehension via Transformer with Content-aware Query

Abstract: Video Referring Expression Comprehension (REC) aims to localize a target object in video frames referred by the natural language expression. Recently, the Transformerbased methods have greatly boosted the performance limit. However, we argue that the current query design is suboptima and suffers from two drawbacks: 1) the slow training convergence process; 2) the lack of fine-grained alignment. To alleviate this, we aim to couple the pure learnable queries with the content information. Specifically, we set up … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
0
0

Publication Types

Select...

Relationship

0
0

Authors

Journals

citations
Cited by 0 publications
references
References 57 publications
0
0
0
Order By: Relevance

No citations

Set email alert for when this publication receives citations?