Findings of the Association for Computational Linguistics: EMNLP 2021 2021
DOI: 10.18653/v1/2021.findings-emnlp.9
|View full text |Cite
|
Sign up to set email alerts
|

Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding

Abstract: Temporal language grounding (TLG) aims to localize a video segment in an untrimmed video based on a natural language description. To alleviate the expensive cost of manual annotations for temporal boundary labels, we are dedicated to the weakly supervised setting, where only video-level descriptions are provided for training. Most of the existing weakly supervised methods generate a candidate segment set and learn cross-modal alignment through a MIL-based framework. However, the temporal structure of the video… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 16 publications
(7 citation statements)
references
References 42 publications
0
7
0
Order By: Relevance
“…Many weakly supervised approaches leverage contrastive learning to improve visual-textual alignment (Zhang et al 2020(Zhang et al , 2021Ma et al 2020). Recent work employs graphbased methodologies to capture contextual relationships between frames (Tan et al 2021) and iterative approaches for fine-grained alignment between individual query tokens and video frames (Wang, Zhou, and Li 2021).…”
Section: Weakly Supervised and Zero-shot Nlvl Methodsmentioning
confidence: 99%
“…Many weakly supervised approaches leverage contrastive learning to improve visual-textual alignment (Zhang et al 2020(Zhang et al , 2021Ma et al 2020). Recent work employs graphbased methodologies to capture contextual relationships between frames (Tan et al 2021) and iterative approaches for fine-grained alignment between individual query tokens and video frames (Wang, Zhou, and Li 2021).…”
Section: Weakly Supervised and Zero-shot Nlvl Methodsmentioning
confidence: 99%
“…BAR [145] involves additional RL module to progressively refine retrieved proposals. FSAN [149], [153], and LoGAN [154] focus on mining video and query contents and their correlations. Then they design fine-grained cross-modal alignment module for accurate moment localization.…”
Section: Multi-instance Learning Methodsmentioning
confidence: 99%
“…As an emerging and challenging cross-modal task, video moment retrieval using language (VMR) (Anne Hendricks et al 2017;Gao et al 2017) has drawn increasing attention in recent years due to its various applications, such as video understanding (Liu et al 2023h, 2020(Liu et al 2023h, , 2021b(Liu et al 2023h, , 2023b(Liu et al 2023h, , 2022a(Liu et al 2023h, , 2021a(Liu et al 2023h, , 2023g,a, 2022c(Liu et al 2023h, , 2023c(Liu et al 2023h, ,d, 2022bFang et al , 2021a and temporal action localization (Zhang et al 2020b;Fang et al 2022Fang et al , 2023aJi et al 2023e, 2018Ji et al 2023e, , 2023g,f,d,c, 2021Ji et al 2023e, , 2020Ji et al 2023e, , 2019. As shown in Figure 1(a), the VMR task targets locating a video…”
Section: Introductionmentioning
confidence: 99%