The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

Zhang, Hao; Sun, Aixin; Wei, Jing; Zhou, Joey Tianyi

doi:10.48550/arxiv.2201.08071

Cited by 3 publications

(6 citation statements)

References 140 publications

(217 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The temporal sentence grounding in general video (TSGV) is a critical task for cross-modal understanding [11]. The general video for the TSGV task refers to the video source collected from various domains including cooking [12], opening [13], indoors [14], and movies [9].…”

Section: Temporal Sentence Grounding In General Videomentioning

confidence: 99%

Towards Visual-Prompt Temporal Answer Grounding in Instructional Video

Li¹,

Li²,

Sun³

et al. 2023

Preprint

View full text Add to dashboard Cite

<p>The temporal answer grounding in instructional video (TAGV) is a new task naturally derived from temporal sentence grounding in general video (TSGV). Given an untrimmed instructional video and a text question, this task aims at locating the frame span from the video that can semantically answer the question. Existing methods tend to formulate the TAGV task with a visual span-based predictor by matching the video frame span queried by the text question. However, due to the weak correlations of the semantic features between the textual question and visual answer, existing methods adopting the visual span-based predictor perform poorly in the TAGV task. In this paper, we propose a visual-prompt text span localizing (VPTSL) method, which introduces the timestamped subtitles to perform the text span localization. Specifically, we design the text span-based predictor, where the input text question, video subtitles, and visual prompt features are jointly learned with the pre-trained language model for enhancing the joint semantic representations. As a result, the TAGV task is reformulated with the visual-prompt subtitle span prediction matching the visual answer. Extensive experiments on three instructional video datasets, namely MedVidQA, TutorialVQA, and VehicleVQA, show that the proposed method outperforms several state-of-the-art (SOTA) methods by a large margin in terms of mIoU score, which demonstrates the effectiveness of the proposed visual prompt and text span-based predictor. Besides, all the experimental codes and datasets are open-sourced on the website https://github.com/wengsyx/VPTSL.</p>

show abstract

Section: Temporal Sentence Grounding In General Videomentioning

confidence: 99%

Towards Visual-Prompt Temporal Answer Grounding in Instructional Video

Li¹,

Li²,

Sun³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…The temporal sentence grounding in the video (TSGV) is a critical task for cross-modal understanding [8,18]. This task takes a video-query pair as input where the video is a collection of consecutive image frames and the query is a sequence of words.…”

Section: Related Work 21 Temporal Sentence Grounding In Videomentioning

confidence: 99%

“…This frame timeline can be translated into the subtitle span stamp, which locates in spans 8 and 9. The predicted start index shown in the Figure 2 is located in the 𝑃 8 𝑠𝑡𝑎𝑟𝑡 , while the predicted end index locates in the 𝑃 9 𝑒𝑛𝑑 . So the corresponding aligned subtitle stamp can be used as the final results (14.91 ~19.21).…”

Section: Subtitle Span Predictionmentioning

confidence: 99%

“…The goal of the TAGV task is to find the matching video answer span corresponding to its question, aka., visual answering localization. As the natural development from temporal sentence grounding in the video (TSGV) [7,8], the TAGV task is challenging since there are huge gaps between two different modalities. The text is discontinuous in syntactic structure, while the video is continuous within adjacent frames [9].…”

mentioning

confidence: 99%

See 1 more Smart Citation

Towards Visual-Prompt Temporal Answering Grounding in Medical Instructional Video

Li¹,

Weng²,

Sun³

et al. 2022

Preprint

View full text Add to dashboard Cite

The temporal answering grounding in the video (TAGV) is a new task naturally derived from temporal sentence grounding in the video (TSGV). Given an untrimmed video and a text question, this task aims at locating the matching span from the video that can semantically answer the question. Existing methods tend to formulate the TAGV task with a visual span-based question answering (QA) approach by matching the visual frame span queried by the text question. However, due to the weak correlations and huge gaps of the semantic features between the textual question and visual answer, existing methods adopting visual span predictor perform poorly in the TAGV task. To bridge these gaps, we propose a visualprompt text span localizing (VPTSL) method, which introduces the timestamped subtitles as a passage to perform the text span localization for the input text question, and prompts the visual highlight features into the pre-trained language model (PLM) for enhancing the joint semantic representations. Specifically, the context query attention is utilized to perform cross-modal interaction between the extracted textual and visual features. Then, the highlight features are obtained through the video-text highlighting for the visual prompt. To alleviate semantic differences between textual and visual features, we design the text span predictor by encoding the question, the subtitles, and the prompted visual highlight features with the PLM. As a result, the TAGV task is formulated to predict the span of subtitles matching the visual answer. Extensive experiments on the medical instructional dataset, namely MedVidQA, show that the proposed VPTSL outperforms the state-of-the-art (SOTA) method by 28.36% in terms of mIOU with a large margin, which demonstrates the effectiveness of the proposed visual prompt and the text span predictor. CCS CONCEPTS• Information systems → Video search; • Computing methodologies → Neural networks.

show abstract

“…In recent years, we have witnessed great progress on temporal video grounding (TVG) [30,74]. One key to this success comes from the fine-grained dense 3D visual features extracted by 3D convolutional neural networks (CNNs) (e.g., C3D [56] and I3D [3]) since TVG tasks demand spatial-temporal context to locate the temporal interval of the moments described by the text query.…”

Section: Introductionmentioning

confidence: 99%

Text-Visual Prompting for Efficient 2D Temporal Video Grounding

Zhang¹,

Chen²,

Jia³

et al. 2023

Preprint

View full text Add to dashboard Cite

In this paper, we study the problem of temporal video grounding (TVG), which aims to predict the starting/ending time points of moments described by a text sentence within a long untrimmed video. Benefiting from fine-grained 3D visual features, the TVG techniques have achieved remarkable progress in recent years. However, the high complexity of 3D convolutional neural networks (CNNs) makes extracting dense 3D visual features time-consuming, which calls for intensive memory and computing resources. Towards efficient TVG, we propose a novel text-visual prompting (TVP) framework, which incorporates optimized perturbation patterns (that we call 'prompts') into both visual inputs and textual features of a TVG model. In sharp contrast to 3D CNNs, we show that TVP allows us to effectively co-train vision encoder and language encoder in a 2D TVG model and improves the performance of crossmodal feature fusion using only low-complexity sparse 2D visual features. Further, we propose a Temporal-Distance IoU (TDIoU) loss for efficient learning of TVG. Experiments on two benchmark datasets, Charades-STA and Ac-tivityNet Captions datasets, empirically show that the proposed TVP significantly boosts the performance of 2D TVG (e.g., 9.79% improvement on Charades-STA and 30.77% improvement on ActivityNet Captions) and achieves 5× inference acceleration over TVG using 3D visual features. Codes are available at Open.Intel.

show abstract

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

Cited by 3 publications

References 140 publications

Towards Visual-Prompt Temporal Answer Grounding in Instructional Video

Towards Visual-Prompt Temporal Answer Grounding in Instructional Video

Towards Visual-Prompt Temporal Answering Grounding in Medical Instructional Video

Text-Visual Prompting for Efficient 2D Temporal Video Grounding

Contact Info

Product

Resources

About