Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding

Wang, Yuechen; Zhou, Wengang; Li, Houqiang

doi:10.18653/v1/2021.findings-emnlp.9

Cited by 16 publications

(7 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many weakly supervised approaches leverage contrastive learning to improve visual-textual alignment (Zhang et al 2020(Zhang et al , 2021Ma et al 2020). Recent work employs graphbased methodologies to capture contextual relationships between frames (Tan et al 2021) and iterative approaches for fine-grained alignment between individual query tokens and video frames (Wang, Zhou, and Li 2021).…”

Section: Weakly Supervised and Zero-shot Nlvl Methodsmentioning

confidence: 99%

Commonsense for Zero-Shot Natural Language Video Localization

Holla,

Lourentzou

2024

AAAI

View full text Add to dashboard Cite

Zero-shot Natural Language-Video Localization (NLVL) methods have exhibited promising results in training NLVL models exclusively with raw video data by dynamically generating video segments and pseudo-query annotations. However, existing pseudo-queries often lack grounding in the source video, resulting in unstructured and disjointed content. In this paper, we investigate the effectiveness of commonsense reasoning in zero-shot NLVL. Specifically, we present CORONET, a zero-shot NLVL framework that leverages commonsense to bridge the gap between videos and generated pseudo-queries via a commonsense enhancement module. CORONET employs Graph Convolution Networks (GCN) to encode commonsense information extracted from a knowledge graph, conditioned on the video, and cross-attention mechanisms to enhance the encoded video and pseudo-query representations prior to localization. Through empirical evaluations on two benchmark datasets, we demonstrate that CORONET surpasses both zero-shot and weakly supervised baselines, achieving improvements up to 32.13% across various recall thresholds and up to 6.33% in mIoU. These results underscore the significance of leveraging commonsense reasoning for zero-shot NLVL.

show abstract

Section: Weakly Supervised and Zero-shot Nlvl Methodsmentioning

confidence: 99%

Commonsense for Zero-Shot Natural Language Video Localization

Holla,

Lourentzou

2024

AAAI

View full text Add to dashboard Cite

show abstract

“…BAR [145] involves additional RL module to progressively refine retrieved proposals. FSAN [149], [153], and LoGAN [154] focus on mining video and query contents and their correlations. Then they design fine-grained cross-modal alignment module for accurate moment localization.…”

Section: Multi-instance Learning Methodsmentioning

confidence: 99%

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

Zhang¹,

Sun²,

Wei³

et al. 2022

Preprint

View full text Add to dashboard Cite

Temporal sentence grounding in videos (TSGV), a.k.a., natural language video localization (NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that semantically corresponds to a language query from an untrimmed video. Connecting computer vision and natural language, TSGV has drawn significant attention from researchers in both communities. This survey attempts to provide a summary of fundamental concepts in TSGV and current research status, as well as future research directions. As the background, we present a common structure of functional components in TSGV, in a tutorial style: from feature extraction from raw video and language query, to answer prediction of the target moment. Then we review the techniques for multimodal understanding and interaction, which is the key focus of TSGV for effective alignment between the two modalities. We construct a taxonomy of TSGV techniques and elaborate methods in different categories with their strengths and weaknesses. Lastly, we discuss issues with the current TSGV research and share our insights about promising research directions.

show abstract

“…As an emerging and challenging cross-modal task, video moment retrieval using language (VMR) (Anne Hendricks et al 2017;Gao et al 2017) has drawn increasing attention in recent years due to its various applications, such as video understanding (Liu et al 2023h, 2020(Liu et al 2023h, , 2021b(Liu et al 2023h, , 2023b(Liu et al 2023h, , 2022a(Liu et al 2023h, , 2021a(Liu et al 2023h, , 2023g,a, 2022c(Liu et al 2023h, , 2023c(Liu et al 2023h, ,d, 2022bFang et al , 2021a and temporal action localization (Zhang et al 2020b;Fang et al 2022Fang et al , 2023aJi et al 2023e, 2018Ji et al 2023e, , 2023g,f,d,c, 2021Ji et al 2023e, , 2020Ji et al 2023e, , 2019. As shown in Figure 1(a), the VMR task targets locating a video…”

Section: Introductionmentioning

confidence: 99%

Fewer Steps, Better Performance: Efficient Cross-Modal Clip Trimming for Video Moment Retrieval Using Language

Fang,

Liu,

Fang

et al. 2024

AAAI

View full text Add to dashboard Cite

Given an untrimmed video and a sentence query, video moment retrieval using language (VMR) aims to locate a target query-relevant moment. Since the untrimmed video is overlong, almost all existing VMR methods first sparsely down-sample each untrimmed video into multiple fixed-length video clips and then conduct multi-modal interactions with the query feature and expensive clip features for reasoning, which is infeasible for long real-world videos that span hours. Since the video is downsampled into fixed-length clips, some query-related frames may be filtered out, which will blur the specific boundary of the target moment, take the adjacent irrelevant frames as new boundaries, easily leading to cross-modal misalignment and introducing both boundary-bias and reasoning-bias. To this end, in this paper, we propose an efficient approach, SpotVMR, to trim the query-relevant clip. Besides, our proposed SpotVMR can serve as plug-and-play module, which achieves efficiency for state-of-the-art VMR methods while maintaining good retrieval performance. Especially, we first design a novel clip search model that learns to identify promising video regions to search conditioned on the language query. Then, we introduce a set of low-cost semantic indexing features to capture the context of objects and interactions that suggest where to search the query-relevant moment. Also, the distillation loss is utilized to address the optimization issues arising from end-to-end joint training of the clip selector and VMR model. Extensive experiments on three challenging datasets demonstrate its effectiveness.

show abstract

Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding

Cited by 16 publications

References 42 publications

Commonsense for Zero-Shot Natural Language Video Localization

Commonsense for Zero-Shot Natural Language Video Localization

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

Fewer Steps, Better Performance: Efficient Cross-Modal Clip Trimming for Video Moment Retrieval Using Language

Contact Info

Product

Resources

About