Proceedings of the 28th International Conference on Computational Linguistics 2020
DOI: 10.18653/v1/2020.coling-main.167
|View full text |Cite
|
Sign up to set email alerts
|

Reasoning Step-by-Step: Temporal Sentence Localization in Videos via Deep Rectification-Modulation Network

Abstract: Temporal sentence localization in videos aims to ground the best matched segment in an untrimmed video according to a given sentence query. Previous works in this field mainly rely on single-step attentional frameworks to align the temporal boundaries by a soft selection. Although they focus on the visual content relevant to the query, these attention strategies are insufficient to model complex video contents and restrict the higher-level reasoning demand for temporal relation. In this paper, we propose a nov… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
4

Relationship

4
5

Authors

Journals

citations
Cited by 19 publications
(11 citation statements)
references
References 30 publications
0
11
0
Order By: Relevance
“…Subsequent work generally follows the strategies of TGN or SCDM with more sophisticated learning modules and/or auxiliary objectives. To be specific, CMIN [50], [78], CBP [79], FIAN [80], HDRR [81], and MIGCN [82] adopt the strategy of TGN, while CSMGAN [83], RMN [84], IA-Net [85], and DCT-Net [86] apply the strategy of SCDM. These solutions design various cross-modal reasoning strategies to perform more fine-grained and deeper multi-modal interaction between video and query, for precise moment localization.…”
Section: Anchor-based Methodsmentioning
confidence: 99%
“…Subsequent work generally follows the strategies of TGN or SCDM with more sophisticated learning modules and/or auxiliary objectives. To be specific, CMIN [50], [78], CBP [79], FIAN [80], HDRR [81], and MIGCN [82] adopt the strategy of TGN, while CSMGAN [83], RMN [84], IA-Net [85], and DCT-Net [86] apply the strategy of SCDM. These solutions design various cross-modal reasoning strategies to perform more fine-grained and deeper multi-modal interaction between video and query, for precise moment localization.…”
Section: Anchor-based Methodsmentioning
confidence: 99%
“…Therefore, the main challenge in such setting is how to align multi-modal features well to predict precise boundary. Some works (Qu et al 2020;Liu, Qu, and Zhou 2021;Liu et al 2021aLiu et al , 2020aLiu et al , 2022b integrate sentence information with each finegrained video clip unit, and predict the scores of candidate segments by gradually merging the fusion feature sequence over time. Although these methods achieve good performances, they severely rely on the quality of the proposals and are time-consuming.…”
Section: Related Workmentioning
confidence: 99%
“…Therefore, the main challenge in such setting is how to align multi-modal features well to predict precise boundary. Some works (Qu et al 2020;Liu, Qu, and Zhou 2021;Liu et al 2021aLiu et al , 2020aLiu et al , 2022b integrate sentence information with each finegrained video clip unit, and predict the scores of candidate segments by gradually merging the fusion feature sequence over time. Although these methods achieve good performances, they severely rely on the quality of the proposals and are time-consuming.…”
Section: Language-based Semantic Miningmentioning
confidence: 99%