Frame-Wise Cross-Modal Matching for Video Moment Retrieval

Tang, Haoyu; Zhu, Jihua; Liu, Meng; Gao, Zan; Cheng, Zhiyong

doi:10.1109/tmm.2021.3063631

Cited by 50 publications

(22 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Given a sentence query, their target is to localize an image region or a video moment in an image or video, respectively. Obviously, modeling pairwise relations between words in queries and capturing cross-modal interactions are also important for those tasks, so the attention mechanism [23], [25], [26] and the graph neural networks [27], [28], [29] are also adopted in some visual grounding methods. For example, Chen et al [30] proposed to explore the cross-modal interactions between the query and video by a Match-LSTM structure for temporal language grounding task; Liu et al proposed a ROLE model [25] which employed the query attention module to adaptively reweight the features of each word in query according to the video content.…”

Section: B Language Grounding In Visual Datamentioning

confidence: 99%

Query-graph with Cross-gating Attention Model for Text-to-Audio Grounding

Tang¹,

Wang²,

Zhu³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

In this paper, we address the text-to-audio grounding issue, namely, grounding the segments of the sound event described by a natural language query in the untrimmed audio. This is a newly proposed but challenging audio-language task, since it requires to not only precisely localize all the on-and off-sets of the desired segments in the audio, but to perform comprehensive acoustic and linguistic understandings and reason the multimodal interactions between the audio and query. To tackle those problems, the existing method treats the query holistically as a single unit by a global query representation, which fails to highlight the keywords that contain rich semantics. Besides, this method has not fully exploited interactions between the query and audio. Moreover, since the audio and queries are arbitrary and variable in length, many meaningless parts of them are not filtered out in this method, which hinders the grounding of the desired segments.To this end, we propose a novel Query Graph with Cross-gating Attention (QGCA) model, which models the comprehensive relations between the words in query through a novel query graph. Besides, to capture the fine-grained interactions between audio and query, a cross-modal attention module that assigns higher weights to the keywords is introduced to generate the snippet-specific query representations. Finally, we also design a cross-gating module to emphasize the crucial parts as well as weaken the irrelevant ones in the audio and query. We extensively evaluate the proposed QGCA model on the public Audiogrounding dataset with significant improvements over several state-of-the-art methods. Moreover, further ablation study shows the consistent effectiveness of different modules in the proposed QGCA model.

show abstract

Section: B Language Grounding In Visual Datamentioning

confidence: 99%

Query-graph with Cross-gating Attention Model for Text-to-Audio Grounding

Tang¹,

Wang²,

Zhu³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…To overcome the drawback of the anchor-based method, some anchor-free schemes are proposed. These methods [1,[46][47][48][49][50] usually organize a video as a continuous sequence holistically and process it with the sequence network like [8,51] to capture the temporal dependency. Then the interaction between the visual and linguistic sequences is usually applied in different attention operations.…”

Section: Review Of Natural Language Video Localizationmentioning

confidence: 99%

“…For example, [46,47] first perform a binary classification for each frame in the sequence, then densely regress the distances from boundaries for all positive frames. And [1,[48][49][50] directly predict three probability scores for each frame being the foreground annotation, the start, and end boundaries with two kinds of classification losses. These methods greatly reduce the redundant computation bringing by candidate clips, and at the same time make good use of the temporal dependence in the video, which achieves appealing performance.…”

Section: Review Of Natural Language Video Localizationmentioning

confidence: 99%

“…For prediction, the common practice is to fuse the linguistic feature with the clip's feature to get a fused representation which is then used for predicting an alignment score indicating the matching degree and the fine-tuning offset to the actual boundaries. On the other hand, anchor-free methods like [1,[46][47][48][49][50] model the video as a continuous sequence holistically and predict the specific start and end boundaries in the sequence. As for the detailed operations of prediction, [46,47] first perform a binary classification for each frame in the sequence, and then densely regress the distances from boundaries for all positive frames.…”

Section: Introductionmentioning

confidence: 99%

“…However, compared with the classification, it is difficult to regress the accurate boundary values directly with the local representations. Therefore, recently [1,[48][49][50] directly predict three probability scores for each frame being the foreground annotation, the start, and end boundaries with two kinds of classification loss. We also model the localization as a classification task, but the difference from previous jobs is that we only conduct the binary classification of every atomic representation in the sequence being the foreground or background and then localize the start and end boundaries in the obtained score vector through some postprocessing methods.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations