Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment Retrieval in Videos

Zhang, Zhu; Lin, Zhijie; Zhao, Zhou; Zhu, Jieming; He, Xiuqiang

doi:10.1145/3394171.3413967

Cited by 61 publications

(27 citation statements)

References 42 publications

(96 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is formally named as weakly supervised TSGV. The typical methods include WSDEC [14], TGA [43], WSLLN [17], SCN [34], Chen et al [12], VLANet [40], MARN [54], BAR [64], RTBPN [85], CCL [86], EC-SL [11], LoGAN [55] and CRM [26]. In general, weakly supervised methods for TSGV can be grouped into two categories (i.e., MIL-based and reconstruction-based).…”

Section: Weakly Supervised Methodsmentioning

confidence: 99%

“…[86] design a counterfactual contrastive learning paradigm to improve the visual-and-language grounding tasks. A regularized two-branch proposal network (RTBPN) [85] is also presented to explore sufficient intra-sample confrontment with sharable two-branch proposal module for distinguishing the target moment from plausible negative moments.…”

Section: Weakly Supervised Methodsmentioning

confidence: 99%

See 1 more Smart Citation

A Survey on Temporal Sentence Grounding in Videos

Lan¹,

Yuan²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Temporal sentence grounding in videos (TSGV), which aims to localize one target segment from an untrimmed video with respect to a given sentence query, has drawn increasing attentions in the research community over the past few years. Different from the task of temporal action localization, TSGV is more flexible since it can locate complicated activities via natural languages, without restrictions from predefined action categories. Meanwhile, TSGV is more challenging since it requires both textual and visual understanding for semantic alignment between two modalities (i.e., text and video). In this survey, we give a comprehensive overview for TSGV, which i) summarizes the taxonomy of existing methods, ii) provides a detailed description of the evaluation protocols (i.e., datasets and metrics) to be used in TSGV, and iii) in-depth discusses potential problems of current benchmarking designs and research directions for further investigations. To the best of our knowledge, this is the first systematic survey on temporal sentence grounding. More specifically, we first discuss existing TSGV approaches by grouping them into four categories, i.e., two-stage methods, end-to-end methods, reinforcement learning-based methods, and weakly supervised methods. Then we present the benchmark datasets and evaluation metrics to assess current research progress. Finally, we discuss some limitations in TSGV through pointing out potential problems improperly resolved in the current evaluation protocols, which may push forwards more cutting edge research in TSGV. Besides, we also share our insights on several promising directions, including three typical tasks with new and practical settings based on TSGV.

show abstract

Section: Weakly Supervised Methodsmentioning

confidence: 99%

Section: Weakly Supervised Methodsmentioning

confidence: 99%

A Survey on Temporal Sentence Grounding in Videos

Lan¹,

Yuan²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Weakly-supervised temporal video grounding. To ease the human labelling efforts, several works (Bojanowski et al 2015;Mithun, Paul, and Roy-Chowdhury 2019;Lin et al 2020;Song et al 2020;Zhang et al 2020b;Ma et al 2020;Tan et al 2021) consider a weakly-supervised setting which only access the information of matched videoquery pairs without accurate segment boundaries. (Mithun, Paul, and Roy-Chowdhury 2019) utilize the dependency between video and sentence as the supervision while abandon the temporal ordered information.…”

Section: Language-based Semantic Miningmentioning

confidence: 99%

Unsupervised Temporal Video Grounding with Deep Semantic Clustering

Daizong¹,

Qu²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Temporal video grounding (TVG) aims to localize a target segment in a video according to a given sentence query. Though respectable works have made decent achievements in this task, they severely rely on abundant video-query paired data, which is expensive and time-consuming to collect in real-world scenarios. In this paper, we explore whether a video grounding model can be learned without any paired annotations. To the best of our knowledge, this paper is the first work trying to address TVG in an unsupervised setting. Considering there is no paired supervision, we propose a novel Deep Semantic Clustering Network (DSCNet) to leverage all semantic information from the whole query set to compose the possible activity in each video for grounding. Specifically, we first develop a language semantic mining module, which extracts implicit semantic features from the whole query set. Then, these language semantic features serve as the guidance to compose the activity in video via a video-based semantic aggregation module. Finally, we utilize a foreground attention branch to filter out the redundant background activities and refine the grounding results. To validate the effectiveness of our DSCNet, we conduct experiments on both Ac-tivityNet Captions and Charades-STA datasets. The results demonstrate that DSCNet achieves competitive performance, and even outperforms most weakly-supervised approaches.

show abstract

“…the video-sentence pairs without the temporal labels (i.e., start and end time). Zhang et al [33] developed a shareable two-branch framework that simultaneously took the inter-and intra-sample confrontation into account.…”

Section: A Temporal Sentence Groundingmentioning

confidence: 99%

“…• RTBPN [33]: The RTBPN method devises a shareable two-branch proposal framework to consider both the inter-and intra-sample confrontation.…”

Section: Performance Comparisonsmentioning

confidence: 99%

Coarse-to-Fine Spatial-Temporal Relationship Inference for Temporal Sentence Grounding

Yang

et al. 2021

IEEE Access

View full text Add to dashboard Cite

Temporal sentence grounding aims to ground a query sentence into a specific segment of the video. Previous methods follow the common equally-spaced frame selection mechanism for appearance and motion modeling, which fails to consider redundant and distracting visual information. There is also no guarantee that all meaningful frames can be obtained. Moreover, this task needs to detect the location clues precisely from both spatial and temporal dimensions, but the relationship between spatialtemporal semantic information and query sentence is still unexplored in existing methods. Inspired by human thinking patterns, we propose a Coarse-to-Fine Spatial-Temporal Relationship Inference (CFSTRI) network to progressively localize fine-grained activity segments. Firstly, we present a coarse-grained crucial frame selection module, where the query-guided local difference context modeling from adjacent frames helps discriminate all the coarse boundary locations relevant to the sentence semantics, and the soft assignment vector of locally aggregated descriptors are employed to enhance the representation of selected frames. Then, we develop a fine-grained spatial-temporal relationship matching module to refine the coarse boundaries, which disentangles the spatial and temporal semantic information from query sentence to guide the excavation of visual grounding clues of corresponding dimensions. Furthermore, we devise a gated graph convolution network to incorporate the spatial-temporal semantic information by leveraging a gate operation to highlight frames referred to by the query sentence from spatial and temporal dimensions, and propagate fused information on the graph. Extensive experiments on two benchmark datasets demonstrate that our CFSTRI significantly outperforms most state-of-the-art methods.

show abstract

Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment Retrieval in Videos

Cited by 61 publications

References 42 publications

A Survey on Temporal Sentence Grounding in Videos

A Survey on Temporal Sentence Grounding in Videos

Unsupervised Temporal Video Grounding with Deep Semantic Clustering

Coarse-to-Fine Spatial-Temporal Relationship Inference for Temporal Sentence Grounding

Contact Info

Product

Resources

About