2023
DOI: 10.36227/techrxiv.22082723
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation

Abstract: <p>Open Vocabulary Instance Segmentation</p>

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
2
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 0 publications
0
2
0
Order By: Relevance
“…It also proposes to use cached web data to enhance the novel classes during training. Based on the DETR [4] framework, CORA [219] proposes region prompting and anchor pre-matching. The former reduces the gap between the whole image and region distributions by prompting the region features of the CLIP-based region classifier, while the latter learns generalizable object localization via a class-aware matching mechanism.…”
Section: Open Vocabulary Object Detectionmentioning
confidence: 99%
“…It also proposes to use cached web data to enhance the novel classes during training. Based on the DETR [4] framework, CORA [219] proposes region prompting and anchor pre-matching. The former reduces the gap between the whole image and region distributions by prompting the region features of the CLIP-based region classifier, while the latter learns generalizable object localization via a class-aware matching mechanism.…”
Section: Open Vocabulary Object Detectionmentioning
confidence: 99%
“…Rahman et al [32] further advanced ZSD with an enhanced visual-semantic alignment technique, employing a polarity loss function to improve discrimination between positive and negative predictions significantly. More recently, the advent of vision-language pretraining has led to ZSD being conceptualized as an image-text matching problem [33]- [35], leveraging large-scale imagetext data to expand the number of training classes. Inspired by these methods, this research utilizes a pre-trained visionlanguage model for unseen pill detection.…”
Section: B Zero-shot Object Detectionmentioning
confidence: 99%
“…This setting hampers the broader applicability of SGG models in diverse real-world applications. Influenced by the achievements in open vocabulary object detection [7,14,36,41,47], recent works [9,45] attempt to extend the SGG task from closed-set to open vocabulary domain. However, they focus on an object-centric open vocabulary setting, which only considers the scene graph nodes.…”
Section: Introductionmentioning
confidence: 99%