2022
DOI: 10.1109/tmm.2021.3063631
|View full text |Cite
|
Sign up to set email alerts
|

Frame-Wise Cross-Modal Matching for Video Moment Retrieval

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
22
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 50 publications
(22 citation statements)
references
References 46 publications
0
22
0
Order By: Relevance
“…Given a sentence query, their target is to localize an image region or a video moment in an image or video, respectively. Obviously, modeling pairwise relations between words in queries and capturing cross-modal interactions are also important for those tasks, so the attention mechanism [23], [25], [26] and the graph neural networks [27], [28], [29] are also adopted in some visual grounding methods. For example, Chen et al [30] proposed to explore the cross-modal interactions between the query and video by a Match-LSTM structure for temporal language grounding task; Liu et al proposed a ROLE model [25] which employed the query attention module to adaptively reweight the features of each word in query according to the video content.…”
Section: B Language Grounding In Visual Datamentioning
confidence: 99%
“…Given a sentence query, their target is to localize an image region or a video moment in an image or video, respectively. Obviously, modeling pairwise relations between words in queries and capturing cross-modal interactions are also important for those tasks, so the attention mechanism [23], [25], [26] and the graph neural networks [27], [28], [29] are also adopted in some visual grounding methods. For example, Chen et al [30] proposed to explore the cross-modal interactions between the query and video by a Match-LSTM structure for temporal language grounding task; Liu et al proposed a ROLE model [25] which employed the query attention module to adaptively reweight the features of each word in query according to the video content.…”
Section: B Language Grounding In Visual Datamentioning
confidence: 99%
“…To overcome the drawback of the anchor-based method, some anchor-free schemes are proposed. These methods [1,[46][47][48][49][50] usually organize a video as a continuous sequence holistically and process it with the sequence network like [8,51] to capture the temporal dependency. Then the interaction between the visual and linguistic sequences is usually applied in different attention operations.…”
Section: Review Of Natural Language Video Localizationmentioning
confidence: 99%
“…For example, [46,47] first perform a binary classification for each frame in the sequence, then densely regress the distances from boundaries for all positive frames. And [1,[48][49][50] directly predict three probability scores for each frame being the foreground annotation, the start, and end boundaries with two kinds of classification losses. These methods greatly reduce the redundant computation bringing by candidate clips, and at the same time make good use of the temporal dependence in the video, which achieves appealing performance.…”
Section: Review Of Natural Language Video Localizationmentioning
confidence: 99%
See 2 more Smart Citations