2020
DOI: 10.1609/aaai.v34i07.6763
|View full text |Cite
|
Sign up to set email alerts
|

Weakly-Supervised Video Re-Localization with Multiscale Attention Model

Abstract: Video re-localization aims to localize a sub-sequence, called target segment, in an untrimmed reference video that is similar to a given query video. In this work, we propose an attention-based model to accomplish this task in a weakly supervised setting. Namely, we derive our CNN-based model without using the annotated locations of the target segments in reference videos. Our model contains three modules. First, it employs a pre-trained C3D network for feature extraction. Second, we design an attention mechan… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 9 publications
(6 citation statements)
references
References 27 publications
0
6
0
Order By: Relevance
“…In a recently proposed line of research, video re-localization, Feng et al [9] propose to localize in a query video segments that correspond semantically to a given reference video. Huang et al [15] extends the original formulation so as to learn without using temporal boundaries information in the training set by utilizing a multiscale attention module. Besides, Yang et al [52] assume only one class in each query video and more than one support videos.…”
Section: Few-shot Learningmentioning
confidence: 99%
See 3 more Smart Citations
“…In a recently proposed line of research, video re-localization, Feng et al [9] propose to localize in a query video segments that correspond semantically to a given reference video. Huang et al [15] extends the original formulation so as to learn without using temporal boundaries information in the training set by utilizing a multiscale attention module. Besides, Yang et al [52] assume only one class in each query video and more than one support videos.…”
Section: Few-shot Learningmentioning
confidence: 99%
“…Traditional fully-supervised deep learning methods typically require large amounts of annotated data, introducing a significant proneto-ambiguity annotation workload [36,37,47,55]. For this reason, learning with scarce data (i.e., few-shot learning) has received increasing attention, in domains like object detection [8,14,28,[42][43][44], action recognition [1,2,4,13,54,59], and action localization [9,15,51,52]. Current works in this domain either learn using trimmed [1,2,4,22,58,60] or well-annotated untrimmed videos [51], or address class-agnostic localization tasks [9,15,52] -learning with both scarce data and limited annotation for both action recognition and localization is still an under-explored area.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Chen et al [15] proposed Spatial-temporal attention-aware (STAL) to deal with the salient features of pedestrian video in spatial and temporal dimensions, this method mainly calculates the score of spatial and temporal feature information, so as to obtain the salient area. Huang et al [16] proposed a CNN model based on coattention mechanism, which is used to estimate the similarity between video actions and extract multi-scale temporal features. Shu et al [17] proposed a new Skeleton-joint Co-Attention (SCA) mechanism on RNN, which is extracted for subsequent human action recognition from sequences by learning the skeleton joint cooperative attention feature map.…”
Section: Deep Learning Of Action Feature Presentation Based On Attention Mechanismmentioning
confidence: 99%