2022
DOI: 10.1609/aaai.v36i3.20163
|View full text |Cite
|
Sign up to set email alerts
|

Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

Abstract: Temporal grounding aims to localize a video moment which is semantically aligned with a given natural language query. Existing methods typically apply a detection or regression pipeline on the fused representation with the research focus on designing complicated prediction heads or fusion strategies. Instead, from a perspective on temporal grounding as a metric-learning problem, we present a Mutual Matching Network (MMN), to directly model the similarity between language queries and video moments in a joint em… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
47
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 70 publications
(47 citation statements)
references
References 78 publications
0
47
0
Order By: Relevance
“…However, the guidance of the generation of subquery-level information is not fully guaranteed. MMN [26] trains the model to distinguish matched and unmatched video-sentence pairs collected from intra-video and inter-video in order to find the relationship between positive and negative sentences. It also ignores the relationship between phrases and video.…”
Section: Fully-supervised Temporal Localizationmentioning
confidence: 99%
“…However, the guidance of the generation of subquery-level information is not fully guaranteed. MMN [26] trains the model to distinguish matched and unmatched video-sentence pairs collected from intra-video and inter-video in order to find the relationship between positive and negative sentences. It also ignores the relationship between phrases and video.…”
Section: Fully-supervised Temporal Localizationmentioning
confidence: 99%
“…Video grounding [2,11,14,29,42,45,52] is an important task in video understanding which aims to identify the timestamps semantically corresponding to the given query within the untrimmed videos. It remains a challenging task since it needs to not only model complex cross-modal interactions, but also capture comprehensive contextual information for semantic alignment.…”
Section: Introductionmentioning
confidence: 99%
“…(2) The semantic gap between DVC and video grounding datasets leads to errors between the generated dense caption and the ground truth. Specifically, We implement this data augmentation idea on two representative methods (i.e., MMN [42] and 2D-TAN [52]). The experimental results on ActivityNet Captions dataset are shown in Fig.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…It also devises a novel multi-stage boundary regression to refine the predicted moments. Instead of using the simple Hadamard product, DMN [87] proposes to project proposals and query features to common embedding space and leverage metric learning for cross-modal pair discrimination.…”
Section: D-mapmentioning
confidence: 99%