Find and Focus: Retrieve and Localize Video Events with Natural Language Queries

Shao, Dian; Xiong, Yingfei; Zhao, Yue; Huang, Qingqiu; Qiao, Yu; Lin, Dahua

doi:10.1007/978-3-030-01240-3_13

Cited by 86 publications

(36 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…1. It has also drawn great attention from industry due to its various applications such as video question answering Lei et al, 2018), video content retrieval Shao et al, 2018), and human-computer interaction , etc.…”

Section: Introductionmentioning

confidence: 99%

Relation-aware Video Reading Comprehension for Temporal Language Grounding

Gao¹,

Xin²,

Xu³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Temporal language grounding in videos aims to localize the temporal span relevant to the given query sentence. Previous methods treat it either as a boundary regression task or a span extraction task. This paper will formulate temporal language grounding into video reading comprehension and propose a Relation-aware Network (RaNet) to address it. This framework aims to select a video moment choice from the predefined answer set with the aid of coarse-and-fine choice-query interaction and choice-choice relation construction. A choicequery interactor is proposed to match the visual and textual information simultaneously in sentence-moment and token-moment levels, leading to a coarse-and-fine cross-modal interaction. Moreover, a novel multi-choice relation constructor is introduced by leveraging graph convolution to capture the dependencies among video moment choices for the best choice selection. Extensive experiments on ActivityNet-Captions, TACoS, and Charades-STA demonstrate the effectiveness of our solution. Codes will be available at https: //github.com/Huntersxsx/RaNet.

show abstract

Section: Introductionmentioning

confidence: 99%

Relation-aware Video Reading Comprehension for Temporal Language Grounding

Gao¹,

Xin²,

Xu³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…Considering the naive proposal method provides no flexible window size, it cannot locate moment more accurately. In view of this, Shao et al [37] proposed to use the correlation between each clip and sentence to select candidate windows. The Query-guided Segment Proposal Network (QSPN) proposed by Xu et al [38] is also along the same lines, and they also added video captioning as a secondary auxiliary to help training.…”

Section: A Supervised Methodsmentioning

confidence: 99%

A Survey on Natural Language Video Localization

Liu,

Nie,

Tan

et al. 2021

Preprint

View full text Add to dashboard Cite

Natural language video localization (NLVL), which aims to locate a target moment from a video that semantically corresponds to a text query, is a novel and challenging task. Toward this end, in this paper, we present a comprehensive survey of the NLVL algorithms, where we first propose the pipeline of NLVL, and then categorize them into supervised and weaklysupervised methods, following by the analysis of the strengths and weaknesses of each kind of methods. Subsequently, we present the dataset, evaluation protocols and the general performance analysis. Finally, the possible perspectives are obtained by summarizing the existing methods.

show abstract

“…Current mainstream trackers [10,11,12,129,130,131,132,133,134,135] adopt tracking-by-detection (TBD) by first performing per-frame detection and then associating the detected boxes in the temporal dimension. Current works [13,136,137,138,139] leverage trajectories or tubes to capture motion trails of targets. MOT variants include e.g., video object segmentation (VOS) [14,15], video instance segmentation (VIS) [16], multi-object tracking and segmentation (MOTS) [140] and video panoptic segmentation (VPS) [141].…”

Section: Object Trackingmentioning

confidence: 99%

Few-Shot Video Object Detection

Fan¹,

Tang

Tai³

2021

Preprint

View full text Add to dashboard Cite

We introduce Few-Shot Video Object Detection (FSVOD) with three important contributions: 1) a large-scale video dataset FSVOD-500 comprising of 500 classes with classbalanced videos in each category for few-shot learning; 2) a novel Tube Proposal Network (TPN) to generate highquality video tube proposals to aggregate feature representation for the target video object; 3) a strategically improved Temporal Matching Network (TMN+) to match representative query tube features and supports with better discriminative ability. Our TPN and TMN+ are jointly and end-to-end trained. Extensive experiments demonstrate that our method produces significantly better detection results on two few-shot video object detection datasets compared to image-based methods and other naive video-based extensions. Codes and datasets will be released at https: //github.com/fanq15/FewX.

show abstract

Find and Focus: Retrieve and Localize Video Events with Natural Language Queries

Cited by 86 publications

References 35 publications

Relation-aware Video Reading Comprehension for Temporal Language Grounding

Relation-aware Video Reading Comprehension for Temporal Language Grounding

A Survey on Natural Language Video Localization

Few-Shot Video Object Detection

Contact Info

Product

Resources

About