Text-Based Localization of Moments in a Video Corpus

Paul, Sudipta; Mithun, Niluthpol Chowdhury; Roy-Chowdhury, Amit K.

doi:10.1109/tip.2021.3120038

Cited by 15 publications

(3 citation statements)

References 59 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many follow-ups further boost the zero-shot transferable ability, e.g., CoOp [57], CLIP-Adapter [14], and Tip-adapter [55]. In video domains, similar idea has also been explored for transferable representation learning [26], and text based action localization [32]. CLIP is used recently in action recognition [43] and TAD [19,30].…”

Section: Related Workmentioning

confidence: 99%

Zero-Shot Temporal Action Detection via Vision-Language Prompting

Nag

Zhu

Song

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection (TAD) to new classes. The former adapts a pretrained vision model to a new task represented by as few as a single video per class, whilst the latter requires no training examples by exploiting a semantic description of the new class. In this work, we introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD by leveraging few-shot support videos and new class names jointly. To tackle this problem, we further introduce a novel MUlti-modality PromPt mETa-learning (MUPPET) method. This is enabled by efficiently bridging pretrained vision and language models whilst maximally reusing already learned capacity. Concretely, we construct multi-modal prompts by mapping support videos into the textual token space of a vision-language model using a meta-learned adapter-equipped visual semantics tokenizer. To tackle large intra-class variation, we further design a query feature regulation scheme. Extensive experiments on ActivityNetv1.3 and THUMOS14 demonstrate that our MUPPET outperforms state-of-the-art alternative methods, often by a large margin. We also show that our MUPPET can be easily extended to tackle the few-shot object detection problem and again achieves the state-ofthe-art performance on MS-COCO dataset. The code will be available in https://github.com/sauradip/ MUPPET

show abstract

Section: Related Workmentioning

confidence: 99%

Zero-Shot Temporal Action Detection via Vision-Language Prompting

Nag

Zhu

Song

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Since then, many follow-ups have been proposed, including improved training strategy (e.g., CoOp [54], CLIP-Adapter [12], Tip-adapter [50]). In video domains, similar idea has also been explored for transferable representation learning [24], text based action localization [32]. CLIP has also been used very recently in action recognition (e.g., ActionCLIP [41]) and TAD [17].…”

Section: Related Workmentioning

confidence: 99%

Zero-Shot Temporal Action Detection via Vision-Language Prompting

Nag¹,

Zhu²,

Song³

et al. 2022

Preprint

View full text Add to dashboard Cite

Existing temporal action detection (TAD) methods rely on large training data including segment-level annotations, limited to recognizing previously seen classes alone during inference. Collecting and annotating a large training set for each class of interest is costly and hence unscalable. Zero-shot TAD (ZS-TAD) resolves this obstacle by enabling a pre-trained model to recognize any unseen action classes. Meanwhile, ZS-TAD is also much more challenging with significantly less investigation. Inspired by the success of zero-shot image classification aided by vision-language (ViL) models such as CLIP, we aim to tackle the more complex TAD task. An intuitive method is to integrate an off-the-shelf proposal detector with CLIP style classification. However, due to the sequential localization (e.g., proposal generation) and classification design, it is prone to localization error propagation. To overcome this problem, in this paper we propose a novel zero-Shot Temporal Action detection model via Vision-LanguagE prompting (STALE). Such a novel design effectively eliminates the dependence between localization and classification by breaking the route for error propagation in-between. We further introduce an interaction mechanism between classification and localization for improved optimization. Extensive experiments on standard ZS-TAD video benchmarks show that our STALE significantly outperforms stateof-the-art alternatives. Besides, our model also yields superior results on supervised TAD over recent strong competitors. The PyTorch implementation of STALE is available on https://github.com/sauradip/STALE.

show abstract

“…A few recent methods [38,199,215,[252][253][254] tackle the VCMR problems. Zhang et al [199] develop a hierarchical multi-modal encoder to learn multimodal interactions at both coarse-and fine-grained granularities.…”

Section: Video Corpus Moment Retrievalmentioning

confidence: 99%

Towards temporal sentence grounding in videos

Zhang¹

View full text Add to dashboard Cite

show abstract

Text-Based Localization of Moments in a Video Corpus

Cited by 15 publications

References 59 publications

Zero-Shot Temporal Action Detection via Vision-Language Prompting

Zero-Shot Temporal Action Detection via Vision-Language Prompting

Zero-Shot Temporal Action Detection via Vision-Language Prompting

Towards temporal sentence grounding in videos

Contact Info

Product

Resources

About