2018
DOI: 10.48550/arxiv.1811.10080
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Learning to discover and localize visual objects with open vocabulary

Keren Ye,
Mingda Zhang,
Wei Li
et al.

Abstract: To alleviate the cost of obtaining accurate bounding boxes for training today's state-of-the-art object detection models, recent weakly supervised detection work has proposed techniques to learn from image-level labels. However, requiring discrete image-level labels is both restrictive and suboptimal. Real-world "supervision" usually consists of more unstructured text, such as captions. In this work we learn association maps between images and captions. We then use a novel objectness criterion to rank the resu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 41 publications
0
3
0
Order By: Relevance
“…It is similar to a closed vocabulary, yet the rich semantic content of captions is discarded. In contrast, research in [123] and [158] aims to discover an open set of object classes from image-caption corpora, and learns detectors for each discovered class.…”
Section: B) Weak Supervisionmentioning
confidence: 99%
“…It is similar to a closed vocabulary, yet the rich semantic content of captions is discarded. In contrast, research in [123] and [158] aims to discover an open set of object classes from image-caption corpora, and learns detectors for each discovered class.…”
Section: B) Weak Supervisionmentioning
confidence: 99%
“…Amrani et al [2] train a WSD model based on the presence of a predefined set of words in captions, which is similarly closed-vocabulary, and discards the rich semantic content of captions, which we exploit through transformers. In contrast, Sun et al [35] and Ye et al [44] aim to discover an open set of object classes from image-caption corpora, and learn detectors for each discovered class. A key limitation of all such WSD methods is their inferior object localization accuracy.…”
Section: Related Workmentioning
confidence: 99%
“…[27] extracts names and actions from captions and associates them to the faces and body poses of people shown in the images, to train face and pose models. In a similar spirit, [42] proposes a method to detect objects from captions. While these methods directly use the raw, noisy data, we propose a method to clean and homogenize annotations into a natural vocabulary.…”
Section: Related Workmentioning
confidence: 99%