2020
DOI: 10.1609/aaai.v34i07.6627
|View full text |Cite
|
Sign up to set email alerts
|

Rethinking the Bottom-Up Framework for Query-Based Video Localization

Abstract: In this paper, we focus on the task query-based video localization, i.e., localizing a query in a long and untrimmed video. The prevailing solutions for this problem can be grouped into two categories: i) Top-down approach: It pre-cuts the video into a set of moment candidates, then it does classification and regression for each candidate; ii) Bottom-up approach: It injects the whole query content into each video frame, then it predicts the probabilities of each frame as a ground truth segment boundary (i.e., … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
109
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 162 publications
(109 citation statements)
references
References 37 publications
(49 reference statements)
0
109
0
Order By: Relevance
“…In terms of their context modeling, most approaches [12], [16], [17], [20]- [22], [25] gradually aggregate the context information through a recurrent structure. Some approaches [18], [23], [29] model surrounding clips as the local context using 1D convolution layers, while other approaches model the entire clip as the global context through self-attention modules [19], [24], [26], [27]. Since clips are the shortest moments, the clip-level context is a subset of moment-level context.…”
Section: Related Workmentioning
confidence: 99%
“…In terms of their context modeling, most approaches [12], [16], [17], [20]- [22], [25] gradually aggregate the context information through a recurrent structure. Some approaches [18], [23], [29] model surrounding clips as the local context using 1D convolution layers, while other approaches model the entire clip as the global context through self-attention modules [19], [24], [26], [27]. Since clips are the shortest moments, the clip-level context is a subset of moment-level context.…”
Section: Related Workmentioning
confidence: 99%
“…Though some of them use an additional regression layer to predict the offsets, their candidate-level feature is not suitable for boundary-level regression and result in inferior performance. On the other hand, by comparing our method with frame-based bottom-up approaches (DEBUG [27], TGN [4], CBP [36], GDP [6]), we can observe that our method works better. Since these approaches only use frame-level representation for moment localization, the boundary features are unaware of the moment content they constitute and lack of consistency, which results in poor performance.…”
Section: Performance Comparisonmentioning
confidence: 84%
“…Authors of [27] making full use of positive samples to alleviate the severe imbalance problem. Authors of [6] use a Graph-FPN layer to encoder scene relationships and semantics.…”
Section: Moment Localization By Languagementioning
confidence: 99%
See 1 more Smart Citation
“…And Zhang et al [39] design a 2D temporal map to capture the temporal relations between adjacent moments. Different from the top-down formula, the bottom-up framework [5,6] is designed to directly predict the probabilities of each frame as target boundaries. Further, He and Wang et al [14,33] formulate this task as a problem of sequential decision making and apply the reinforcement learning method to progressively regulate the temporal boundaries.…”
Section: Video Moment Retrievalmentioning
confidence: 99%