2021
DOI: 10.48550/arxiv.2104.14805
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Few-Shot Video Object Detection

Abstract: We introduce Few-Shot Video Object Detection (FSVOD) with three important contributions: 1) a large-scale video dataset FSVOD-500 comprising of 500 classes with classbalanced videos in each category for few-shot learning; 2) a novel Tube Proposal Network (TPN) to generate highquality video tube proposals to aggregate feature representation for the target video object; 3) a strategically improved Temporal Matching Network (TMN+) to match representative query tube features and supports with better discriminative… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1
1

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 143 publications
(191 reference statements)
0
3
0
Order By: Relevance
“…(1) FS setting: Even with 1-way support sets, FS-TAD methods (FS-Trans [51], QAT [31]) still outperform 5-way object detection based counterparts (Feat-RW [20], Meta-DETR [54], FSVOD [9]). This indicates the importance of modeling temporal dynamics and task specific design.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…(1) FS setting: Even with 1-way support sets, FS-TAD methods (FS-Trans [51], QAT [31]) still outperform 5-way object detection based counterparts (Feat-RW [20], Meta-DETR [54], FSVOD [9]). This indicates the importance of modeling temporal dynamics and task specific design.…”
Section: Resultsmentioning
confidence: 99%
“…We replaced their backbones with CLIP ViT encoders and the object decoders with TAD decoders. We similarly adapted a video based object detection method (FSVOD [9]) where temporal action proposals and temporal matching network are applied with TAD decoder. For fair comparison, we deploy MUPPET in the FS setting by discarding the textual input.…”
Section: Comparison With State-of-the-artmentioning
confidence: 99%
“…The last is the metric-based approach [1,17,35,40,46,47], which applies a siamese network [40] on support-query pairs to learn a general metric for evaluating their relevance. Our work, including many few-shot works [23,37,90,80,22] on various high-level computer vision tasks, are inspired by the metric-based approach.…”
Section: Related Workmentioning
confidence: 99%