Few-Shot Video Object Detection

Fan, Qi; Tang, Chi-Keung; Tai, Yu-Wing

doi:10.48550/arxiv.2104.14805

Cited by 2 publications

(3 citation statements)

References 143 publications

(191 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(1) FS setting: Even with 1-way support sets, FS-TAD methods (FS-Trans [51], QAT [31]) still outperform 5-way object detection based counterparts (Feat-RW [20], Meta-DETR [54], FSVOD [9]). This indicates the importance of modeling temporal dynamics and task specific design.…”

Section: Resultsmentioning

confidence: 99%

“…We replaced their backbones with CLIP ViT encoders and the object decoders with TAD decoders. We similarly adapted a video based object detection method (FSVOD [9]) where temporal action proposals and temporal matching network are applied with TAD decoder. For fair comparison, we deploy MUPPET in the FS setting by discarding the textual input.…”

Section: Comparison With State-of-the-artmentioning

confidence: 99%

See 1 more Smart Citation

Zero-Shot Temporal Action Detection via Vision-Language Prompting

Nag

Zhu

Song

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection (TAD) to new classes. The former adapts a pretrained vision model to a new task represented by as few as a single video per class, whilst the latter requires no training examples by exploiting a semantic description of the new class. In this work, we introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD by leveraging few-shot support videos and new class names jointly. To tackle this problem, we further introduce a novel MUlti-modality PromPt mETa-learning (MUPPET) method. This is enabled by efficiently bridging pretrained vision and language models whilst maximally reusing already learned capacity. Concretely, we construct multi-modal prompts by mapping support videos into the textual token space of a vision-language model using a meta-learned adapter-equipped visual semantics tokenizer. To tackle large intra-class variation, we further design a query feature regulation scheme. Extensive experiments on ActivityNetv1.3 and THUMOS14 demonstrate that our MUPPET outperforms state-of-the-art alternative methods, often by a large margin. We also show that our MUPPET can be easily extended to tackle the few-shot object detection problem and again achieves the state-ofthe-art performance on MS-COCO dataset. The code will be available in https://github.com/sauradip/ MUPPET

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Comparison With State-of-the-artmentioning

confidence: 99%

Zero-Shot Temporal Action Detection via Vision-Language Prompting

Nag

Zhu

Song

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…The last is the metric-based approach [1,17,35,40,46,47], which applies a siamese network [40] on support-query pairs to learn a general metric for evaluating their relevance. Our work, including many few-shot works [23,37,90,80,22] on various high-level computer vision tasks, are inspired by the metric-based approach.…”

Section: Related Workmentioning

confidence: 99%

Self-Support Few-Shot Semantic Segmentation

Fan¹,

Pei²,

Tai³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Existing few-shot segmentation methods have achieved great progress based on the support-query matching framework. But they still heavily suffer from the limited coverage of intra-class variations from the few-shot supports provided. Motivated by the simple Gestalt principle that pixels belonging to the same object are more similar than those to different objects of same class, we propose a novel self-support matching strategy to alleviate this problem, which uses query prototypes to match query features, where the query prototypes are collected from high-confidence query predictions. This strategy can effectively capture the consistent underlying characteristics of the query objects, and thus fittingly match query features. We also propose an adaptive self-support background prototype generation module and self-support loss to further facilitate the self-support matching procedure. Our self-support network substantially improves the prototype quality, benefits more improvement from stronger backbones and more supports, and achieves SOTA on multiple datasets. Codes are at https://github.com/fanq15/SSP.

show abstract

Few-Shot Video Object Detection

Cited by 2 publications

References 143 publications

Zero-Shot Temporal Action Detection via Vision-Language Prompting

Zero-Shot Temporal Action Detection via Vision-Language Prompting

Self-Support Few-Shot Semantic Segmentation

Contact Info

Product

Resources

About