Zero-Shot Object Detection: Joint Recognition and Localization of Novel Concepts

Rahman, Shafin; Khan, Salman; Porikli, Fatih

doi:10.1007/s11263-020-01355-6

Cited by 55 publications

(16 citation statements)

References 53 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Zero-shot and open-vocabulary object detection. Zeroshot object detection aims at detecting novel object classes which are not seen during detector training [2,18,38,39,59,64]. Bansal et al [2] learned to match the visual features of cropped image regions to word embeddings using max-margin loss.…”

Section: Related Workmentioning

confidence: 99%

RegionCLIP: Region-based Language-Image Pretraining

Zhong¹,

Yang²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification in both zero-shot and transfer learning settings. However, we show that directly applying such models to recognize image regions for object detection leads to poor performance due to a domain shift: CLIP was trained to match an image as a whole to a text description, without capturing the fine-grained alignment between image regions and text spans. To mitigate this issue, we propose a new method called RegionCLIP that significantly extends CLIP to learn region-level visual representations, thus enabling fine-grained alignment between image regions and textual concepts. Our method leverages a CLIP model to match image regions with template captions, and then pretrains our model to align these region-text pairs in the feature space. When transferring our pretrained model to the open-vocabulary object detection task, our method outperforms the state of the art by 3.8 AP50 and 2.2 AP for novel categories on COCO and LVIS datasets, respectively. Further, the learned region representations support zero-shot inference for object detection, showing promising results on both COCO and LVIS datasets. Our code is available at https://github.com/microsoft/RegionCLIP.

show abstract

Section: Related Workmentioning

confidence: 99%

RegionCLIP: Region-based Language-Image Pretraining

Zhong¹,

Yang²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In zero-shot classification [28,29] and recognition [30,31,32,33], word embeddings commonly replace learnable class prototypes to transfer from training classes to unseen classes using inherit semantic relationships extracted from text corpora. Commonly used word embeddings are GloVe vectors [29,30] and word2vec embeddings [31,34,33,35], however embeddings learnt from image-text pairs using the CLIP [36] achieve the best zero-shot performance so far [32]. Rahman et al [31] argue that a single word embedding per class is insufficient to model the visual-semantic relationships and propose to learn class representations of weighted word embeddings of synonyms and related terms [31].…”

Section: Knowledge-based Embeddingsmentioning

confidence: 99%

“…Nevertheless, pure text embeddings perform consistently best for training classes [30,31,32,33,35,32] in object detection. The projection from visual to semantic space is done by a linear layer [30,37,35], a single [31,34,33] or two-layer MLP [32], and learned with a max-margin losses [30,31,38,37], softplus-margin focal loss [35], or crossentropy loss [33,32]. Zhang et.…”

Section: Knowledge-based Embeddingsmentioning

confidence: 99%

“…Another fundamental design choice in zero-shot object detection is the background class representation. The majority of works rely on an explicit class prototype that is either learned [33,32], computed as the mean of all class vectors [31,34,37], or represented by multiple background word embeddings [30]. However, Li et al [35] noted that an explicit representation can cause confusions with unknown classes and therefore represented background by a distance threshold w.r.t.…”

Section: Knowledge-based Embeddingsmentioning

confidence: 99%

See 1 more Smart Citation

Contrastive Object Detection Using Knowledge Graph Embeddings

Lang¹,

Braun²,

Valada³

2021

Preprint

View full text Add to dashboard Cite

Object recognition for the most part has been approached as a one-hot problem that treats classes to be discrete and unrelated. Each image region has to be assigned to one member of a set of objects, including a background class, disregarding any similarities in the object types. In this work, we compare the error statistics of the class embeddings learned from a one-hot approach with semantically structured embeddings from natural language processing or knowledge graphs that are widely applied in open world object detection. Extensive experimental results on multiple knowledge-embeddings as well as distance metrics indicate that knowledge-based class representations result in more semantically grounded misclassifications while performing on par compared to one-hot methods on the challenging COCO and Cityscapes object detection benchmarks. We generalize our findings to multiple object detection architectures by proposing a knowledge-embedded design knowledge graph embedded (KGE) for keypoint-based and transformer-based object detection architectures.

show abstract

“…In this paper, we perform both transductive and inductive ZSL and GZSL on 3D point cloud objects. Zero-Shot Learning: For the ZSL task, there has been significant progress, including on image recognition [43,74,1,3,37,25,65], multi-label ZSL [26,42], and zero-shot detection [44]. Despite this progress, these methods solve the constrained problem where the test instances are restricted to only unseen classes, rather than being from either seen or unseen classes.…”

Section: Related Workmentioning

confidence: 99%

Zero-shot Learning of 3D Point Cloud Objects

Cheraghian

Rahman

2019

2019 16th International Conference on Machine Vision Applications (MVA)

Self Cite

View full text Add to dashboard Cite

Recent deep learning architectures can recognize instances of 3D point cloud objects of previously seen classes quite well. At the same time, current 3D depth camera technology allows generating/segmenting a large amount of 3D point cloud objects from an arbitrary scene, for which there is no previously seen training data. A challenge for a 3D point cloud recognition system is, then, to classify objects from new, unseen, classes. This issue can be resolved by adopting a zero-shot learning (ZSL) approach for 3D data, similar to the 2D image version of the same problem. ZSL attempts to classify unseen objects by comparing semantic information (attribute/word vector) of seen and unseen classes. Here, we adapt several recent 3D point cloud recognition systems to the ZSL setting with some changes to their architectures. To the best of our knowledge, this is the first attempt to classify unseen 3D point cloud objects in the ZSL setting. A standard protocol (which includes the choice of datasets and the seen/unseen split) to evaluate such systems is also proposed. Baseline performances are reported using the new protocol on the investigated models. This investigation throws a new challenge to the 3D point cloud recognition community that may instigate numerous future works.

show abstract

Zero-Shot Object Detection: Joint Recognition and Localization of Novel Concepts

Cited by 55 publications

References 53 publications

RegionCLIP: Region-based Language-Image Pretraining

RegionCLIP: Region-based Language-Image Pretraining

Contrastive Object Detection Using Knowledge Graph Embeddings

Zero-shot Learning of 3D Point Cloud Objects

Contact Info

Product

Resources

About