2021
DOI: 10.48550/arxiv.2109.08039
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Survey on Temporal Sentence Grounding in Videos

Abstract: Temporal sentence grounding in videos (TSGV), which aims to localize one target segment from an untrimmed video with respect to a given sentence query, has drawn increasing attentions in the research community over the past few years. Different from the task of temporal action localization, TSGV is more flexible since it can locate complicated activities via natural languages, without restrictions from predefined action categories. Meanwhile, TSGV is more challenging since it requires both textual and visual u… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 79 publications
(153 reference statements)
0
2
0
Order By: Relevance
“…Figure 1 shows that the current datasets comprise relatively short videos, containing single structured scenes, and language descriptions that cover most of the video. Furthermore, the temporal anchors for the language are temporally biased, leading to methods not learning from any visual features and eventually overfitting to temporal priors for specific actions, thus limiting their generalization capabilities [9,18].…”
Section: Introductionmentioning
confidence: 99%
“…Figure 1 shows that the current datasets comprise relatively short videos, containing single structured scenes, and language descriptions that cover most of the video. Furthermore, the temporal anchors for the language are temporally biased, leading to methods not learning from any visual features and eventually overfitting to temporal priors for specific actions, thus limiting their generalization capabilities [9,18].…”
Section: Introductionmentioning
confidence: 99%
“…Figure 1 shows that the current datasets comprise relatively short videos, contain single structured scenes, and language descriptions that cover most of the video. Furthermore, the temporal anchors for the language are temporally biased in time (refer to Figure 3), leading to methods not learning from any visual features and eventually overfitting to temporal priors for specific actions, thus limiting their generalization capabilities [7,16].…”
Section: Introductionmentioning
confidence: 99%