Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling

Huynh, Dat; Kuen, Jason; Lin, Zhe; Gu, Jiuxiang; Elhamifar, Ehsan

doi:10.48550/arxiv.2111.12698

Cited by 1 publication

(3 citation statements)

References 80 publications

(117 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To learn the semantics of novel classes, recent methods [3,13,16,41,44] have simplified the problem by providing image-caption pairs as a weak supervision signal. Such pairs are cheap to acquire and make the problem tractable.…”

Section: Related Workmentioning

confidence: 99%

“…Most of these methods require big dataset with millions of image-caption pairs to train such a model. They either use this model to align image-regions with captions and generate object-box pseudo labels [16,44] or as region-image feature extractor to classify the regions [13]. Many weakly-supervised [1,3,7,34,43] approaches have been proposed to perform such object grounding.…”

Section: Related Workmentioning

confidence: 99%

“…CLIP (cropped reg) [13] uses the CLIP pre-trained model on 400M image-caption pairs on object proposals obtained by an object detector trained on known classes. XP-Mask [16] first learns a class-agnostic region proposal and segmentation model from the known classes and then uses this model as a teacher to generate pseudo masks for self-training a student model. Finally, we also compare with VILD [13] which uses CLIP soft predictions to distil semantic information and train an object detector.…”

Section: Baselines Ovrmentioning

confidence: 99%

See 2 more Smart Citations

Localized Vision-Language Matching for Open-vocabulary Object Detection

Bravo¹,

Mittal²,

Brox³

2022

Preprint

View full text Add to dashboard Cite

In this work, we propose an open-world object detection method that, based on image-caption pairs, learns to detect novel object classes along with a given set of known classes. It is a two-stage training approach that first uses a location-guided image-caption matching technique to learn class labels for both novel and known classes in a weaklysupervised manner and second specializes the model for the object detection task using known class annotations. We show that a simple language model fits better than a large contextualized language model for detecting novel objects. Moreover, we introduce a consistency-regularization technique to better exploit image-caption pair information. Our method compares favorably to existing open-world detection approaches while being data-efficient.

show abstract

Section: Related Workmentioning

confidence: 99%