2021
DOI: 10.48550/arxiv.2104.07993
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Self-supervised Video Retrieval Transformer Network

Xiangteng He,
Yulin Pan,
Mingqian Tang
et al.

Abstract: Content-based video retrieval aims to find videos from a large video database that are similar to or even nearduplicate of a given query video. It plays an important role in many video related applications, including copyright protection, recommendation, filtering and etc.. Video representation and similarity search algorithms are crucial to any video retrieval system. To derive effective video representation, most video retrieval systems require a large amount of manually annotated data for training, making i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
2

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(9 citation statements)
references
References 28 publications
0
9
0
Order By: Relevance
“…For efficient video similarity measurement. Most retrieval methods have a straightforward motivation aggregating the local frame-level features into clip-level or even video-level representations, such as global vectors [15,25], hash codes [8,9,19], Bag-of-Words (BoW) [2,14,16], and video similarity is measured by distances of aggregated representations. However, the aggregated representations are too coarse to cover abundant fine-grained information and can't be used to partial segment localization.…”
Section: Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…For efficient video similarity measurement. Most retrieval methods have a straightforward motivation aggregating the local frame-level features into clip-level or even video-level representations, such as global vectors [15,25], hash codes [8,9,19], Bag-of-Words (BoW) [2,14,16], and video similarity is measured by distances of aggregated representations. However, the aggregated representations are too coarse to cover abundant fine-grained information and can't be used to partial segment localization.…”
Section: Related Workmentioning
confidence: 99%
“…However, the aggregated representations are too coarse to cover abundant fine-grained information and can't be used to partial segment localization. Therefore in this paper, we adopt frame-level representation of SVRTN [8] to implement frame encoding, keeping the fine-grained information for partial segment localization.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations