2022
DOI: 10.48550/arxiv.2201.08071
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

Abstract: Temporal sentence grounding in videos (TSGV), a.k.a., natural language video localization (NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that semantically corresponds to a language query from an untrimmed video. Connecting computer vision and natural language, TSGV has drawn significant attention from researchers in both communities. This survey attempts to provide a summary of fundamental concepts in TSGV and current research status, as well as future research directions. As the ba… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
6
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(6 citation statements)
references
References 140 publications
(217 reference statements)
0
6
0
Order By: Relevance
“…The temporal sentence grounding in general video (TSGV) is a critical task for cross-modal understanding [11]. The general video for the TSGV task refers to the video source collected from various domains including cooking [12], opening [13], indoors [14], and movies [9].…”
Section: Temporal Sentence Grounding In General Videomentioning
confidence: 99%
“…The temporal sentence grounding in general video (TSGV) is a critical task for cross-modal understanding [11]. The general video for the TSGV task refers to the video source collected from various domains including cooking [12], opening [13], indoors [14], and movies [9].…”
Section: Temporal Sentence Grounding In General Videomentioning
confidence: 99%
“…The temporal sentence grounding in the video (TSGV) is a critical task for cross-modal understanding [8,18]. This task takes a video-query pair as input where the video is a collection of consecutive image frames and the query is a sequence of words.…”
Section: Related Work 21 Temporal Sentence Grounding In Videomentioning
confidence: 99%
“…This frame timeline can be translated into the subtitle span stamp, which locates in spans 8 and 9. The predicted start index shown in the Figure 2 is located in the 𝑃 8 𝑠𝑡𝑎𝑟𝑡 , while the predicted end index locates in the 𝑃 9 𝑒𝑛𝑑 . So the corresponding aligned subtitle stamp can be used as the final results (14.91 ~19.21).…”
Section: Subtitle Span Predictionmentioning
confidence: 99%
See 1 more Smart Citation
“…In recent years, we have witnessed great progress on temporal video grounding (TVG) [30,74]. One key to this success comes from the fine-grained dense 3D visual features extracted by 3D convolutional neural networks (CNNs) (e.g., C3D [56] and I3D [3]) since TVG tasks demand spatial-temporal context to locate the temporal interval of the moments described by the text query.…”
Section: Introductionmentioning
confidence: 99%