2022
DOI: 10.1007/978-3-031-16788-1_24
|View full text |Cite
|
Sign up to set email alerts
|

Localized Vision-Language Matching for Open-vocabulary Object Detection

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 12 publications
(3 citation statements)
references
References 29 publications
0
3
0
Order By: Relevance
“…VL-PLM [215] proposes to train Faster R-CNN as a two-stage class-agnostic proposal generator using a detection dataset without category information. LocOV [238] uses classagnostic proposals in RPN to train Faster R-CNN by matching the region features and word embeddings from image and caption, respectively. From the data generation view, several works [138], [239], [240] adopt the diffusion model to generate the on-target data for effective training.…”
Section: Open Vocabulary Object Detectionmentioning
confidence: 99%
“…VL-PLM [215] proposes to train Faster R-CNN as a two-stage class-agnostic proposal generator using a detection dataset without category information. LocOV [238] uses classagnostic proposals in RPN to train Faster R-CNN by matching the region features and word embeddings from image and caption, respectively. From the data generation view, several works [138], [239], [240] adopt the diffusion model to generate the on-target data for effective training.…”
Section: Open Vocabulary Object Detectionmentioning
confidence: 99%
“…We utilize the OVAD dataset [41] and the VAW dataset [42] to obtain image-text pairs associated with attribute knowledge. The OVAD dataset consists of 80 object classes and 117 attribute classes.…”
Section: Datasets and Evaluation Metricsmentioning
confidence: 99%
“…As the rising of image-text pair pre-training multimodal models like CLIP (Radford et al, 2021), ALIGN (Jia et al, 2021), ALBEF , researchers spend effort in combining cross-modal knowledge with dense prediction tasks like detection Gu et al, 2021;Bravo et al, 2023; and segmentation (Kim et al, 2023;Liang et al, 2022;Rao et al, 2021;Xu et al, 2023d;Qin et al, 2023;Liang et al, 2023;Xu et al, 2023a,b). However, the text used in these works is restricted to object class words or attributes (Li et al, 2022a).…”
Section: Zero-shot Segmentationmentioning
confidence: 99%