2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.00134
|View full text |Cite
|
Sign up to set email alerts
|

MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment

Abstract: This research strives for natural language moment retrieval in long, untrimmed video streams. The problem is not trivial especially when a video contains multiple moments of interests and the language describes complex temporal dependencies, which often happens in real scenarios. We identify two crucial challenges: semantic misalignment and structural misalignment. However, existing approaches treat different moments separately and do not explicitly model complex moment-wise temporal relations. In this paper, … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
243
0

Year Published

2019
2019
2020
2020

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 308 publications
(243 citation statements)
references
References 52 publications
0
243
0
Order By: Relevance
“…Vision-and-Language Grounding Recently, researchers in both computer vision and natural language processing are striving to bridge vision and natural language towards a deeper understanding of the world [53,47,22,7,19,43,21], e.g., captioning an image or a video with natural language [11,12,46,48,54,55,49] or localizing desired objects within an image given a natural language description [37,20,56,57]. Moreover, visual question answering [4] and visual dialog [9] aim to generate one-turn or multi-turn response by grounding it on both visual and textual modalities.…”
Section: Related Workmentioning
confidence: 99%
“…Vision-and-Language Grounding Recently, researchers in both computer vision and natural language processing are striving to bridge vision and natural language towards a deeper understanding of the world [53,47,22,7,19,43,21], e.g., captioning an image or a video with natural language [11,12,46,48,54,55,49] or localizing desired objects within an image given a natural language description [37,20,56,57]. Moreover, visual question answering [4] and visual dialog [9] aim to generate one-turn or multi-turn response by grounding it on both visual and textual modalities.…”
Section: Related Workmentioning
confidence: 99%
“…They compare the similarity between candidate moments and the query sentence in a common embedding space with ranking loss. Authors of [47] integrated candidate moment generating and temporal reasoning by using a single-shot structure.…”
Section: Moment Localization By Languagementioning
confidence: 99%
“…It is the percentage that at least one of the candidate moments with top-n scores have Intersection over Union (IoU) larger than m. We report the result of n ∈ {1, 5} with m ∈ {0.1, 0.3, 0.5} for TACoS, n ∈ {1, 5} with m ∈ {0.5, 0.7} for Charades-STA, and n ∈ {1, 5} with m ∈ {0.3, 0.5, 0.7} for Activity-Caption, respectively. We evaluate our proposed DPIN approach on three datasets and compare our model with the state-of-the-art methods, including: Candidatebased (top-down) approaches: CTRL [9], MCF [39], ACRN [24], SAP [7], CMIN [50], ACL [10], SCDM [43], ROLE [25], SLTA [16],MAN [47], Xu et al [41], SCDM [43], 2D-TAN [48]. Frame-based (Bottomup) approaches: ABLR [44],GDP [6], TGN [4], CBP [36], ExCL [11], DEBUG [27].…”
Section: Performance Comparisonmentioning
confidence: 99%
“…Text-Image Matching: Learning cross-modal embeddings has numerous applications [61,69] ranging from PINs using facial and voice information [37], to generative feature learning [15] and domain adaptation [63,65]. Nagrani et al [37] demonstrated that a joint representation can be learned from facial and voice information and introduced a curriculum learning strategy [3,45,46] to perform hard negative mining during training.…”
Section: Related Workmentioning
confidence: 99%