2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.00138
|View full text |Cite
|
Sign up to set email alerts
|

Spatio-Temporal Video Re-Localization by Warp LSTM

Abstract: The need for efficiently finding the video content a user wants is increasing because of the erupting of usergenerated videos on the Web. Existing keyword-based or content-based video retrieval methods usually determine what occurs in a video but not when and where. In this paper, we make an answer to the question of when and where by formulating a new task, namely spatio-temporal video re-localization. Specifically, given a query video and a reference video, spatio-temporal video re-localization aims to local… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
35
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 39 publications
(35 citation statements)
references
References 47 publications
0
35
0
Order By: Relevance
“…ing the task of video re-localization in diverse ways. For example, Feng et al (Feng et al 2019) not only localize the temporal boundary of an action clip but also seek the spatial bounding box of that action at each frame. Zhang et al (Zhang et al 2019b) try to use a single query image to localize the relevant action clip in a reference video.…”
Section: Related Workmentioning
confidence: 99%
“…ing the task of video re-localization in diverse ways. For example, Feng et al (Feng et al 2019) not only localize the temporal boundary of an action clip but also seek the spatial bounding box of that action at each frame. Zhang et al (Zhang et al 2019b) try to use a single query image to localize the relevant action clip in a reference video.…”
Section: Related Workmentioning
confidence: 99%
“…To benefit from the space context, spatial pooling divides a video using fixed segmentation grids and pools the features locally in each grid cell [22][23][24]. Though the performance has been improved, different action instances of the same category with various human localizations in spatiotemporal volume [46,47] can result in a non-uniform distribution of features. Furthermore, one interaction component may be divided into different cells due to the fixed segmentation.…”
Section: Mid-level Patch Mining Based On Motion Saliencymentioning
confidence: 99%
“…However, to be effective they need to generate many proposals and a secondary supervised step to find the best fitting one. To avoid the need for extensive supervision, Feng et al [7] introduced one-shot localization of actions in time and space. They rely on proposals as well, but rather than using class supervision, a matching model between a trimmed support video and a long untrimmed video determines the best proposal.…”
Section: Introductionmentioning
confidence: 99%
“…upon both [7] and [48] and propose the new task of few-shot common action localization in time and space. Our approach does not require any box annotations or class labels to obtain the spatio-temporal localization, and neither do we need proposals as in [7,48]. All we require are a handful of trimmed videos showing a common unnamed action, see Figure 1.…”
Section: Introductionmentioning
confidence: 99%