Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval 2019
DOI: 10.1145/3331184.3331235
|View full text |Cite
|
Sign up to set email alerts
|

Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos

Abstract: Query-based moment retrieval aims to localize the most relevant moment in an untrimmed video according to the given natural language query. Existing works often only focus on one aspect of this emerging task, such as the query representation learning, video context modeling or multi-modal fusion, thus fail to develop a comprehensive system for further performance improvement. In this paper, we introduce a novel Cross-Modal Interaction Network (CMIN) to consider multiple crucial factors for this challenging tas… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
153
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
3

Relationship

2
6

Authors

Journals

citations
Cited by 214 publications
(153 citation statements)
references
References 38 publications
(77 reference statements)
0
153
0
Order By: Relevance
“…Activity-Caption [18] was built on ActivityNet v1.3 dataset [14] with diverse context. Following [48,50], we use val_1 as validation set and val_2 as testing set. We have 37, 417, 17, 505, and 17, 031 moment-sentence pairs for training, validation, and testing, respectively.…”
Section: Datasetsmentioning
confidence: 99%
See 1 more Smart Citation
“…Activity-Caption [18] was built on ActivityNet v1.3 dataset [14] with diverse context. Following [48,50], we use val_1 as validation set and val_2 as testing set. We have 37, 417, 17, 505, and 17, 031 moment-sentence pairs for training, validation, and testing, respectively.…”
Section: Datasetsmentioning
confidence: 99%
“…It is the percentage that at least one of the candidate moments with top-n scores have Intersection over Union (IoU) larger than m. We report the result of n ∈ {1, 5} with m ∈ {0.1, 0.3, 0.5} for TACoS, n ∈ {1, 5} with m ∈ {0.5, 0.7} for Charades-STA, and n ∈ {1, 5} with m ∈ {0.3, 0.5, 0.7} for Activity-Caption, respectively. We evaluate our proposed DPIN approach on three datasets and compare our model with the state-of-the-art methods, including: Candidatebased (top-down) approaches: CTRL [9], MCF [39], ACRN [24], SAP [7], CMIN [50], ACL [10], SCDM [43], ROLE [25], SLTA [16],MAN [47], Xu et al [41], SCDM [43], 2D-TAN [48]. Frame-based (Bottomup) approaches: ABLR [44],GDP [6], TGN [4], CBP [36], ExCL [11], DEBUG [27].…”
Section: Performance Comparisonmentioning
confidence: 99%
“…And [Shou et al, 2017] employ temporal upsampling and spatial downsampling operations simultaneously. Furthermore, [Zhao et al, 2017] model the temporal structure of each action instance via a temporal pyramid. skip the proposal generation and directly detect action instances based on temporal convolutional layers.…”
Section: Related Workmentioning
confidence: 99%
“…As shown in Figure 1, the sentence describes multiple complicated events and corresponds to a temporal moment with complex object interactions. Recently, a large amount of methods [4,12,15,33,40] have been proposed to this challenging task and achieved satisfactory performance. However, most existing approaches are trained in the fully-supervised setting with the temporal alignment annotation of each sentence.…”
Section: Introductionmentioning
confidence: 99%
“…As for the two-branch proposal module, two branches have a completely consistent structure and share all parameters. We first develop a conventional cross-modal interaction [4,40] between language and frame sequences. Next, we apply a 2D moment map [39] to capture relationships between adjacent moments.…”
Section: Introductionmentioning
confidence: 99%