2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017
DOI: 10.1109/cvpr.2017.124
|View full text |Cite
|
Sign up to set email alerts
|

Discover and Learn New Objects from Documentaries

Abstract: Despite the remarkable progress in recent years, detecting objects in a new context remains a challenging task. Detectors learned from a public dataset can only work with a fixed list of categories, while training from scratch usually requires a large amount of training data with detailed annotations. This work aims to explore a novel approach -learning object detectors from documentary films in a weakly supervised manner. This is inspired by the observation that documentaries often provide dedicated expositio… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
15
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 23 publications
(15 citation statements)
references
References 41 publications
0
15
0
Order By: Relevance
“…In contrast, in this work no manually annotated visual data is involved at any stage of our approach. To avoid labelling visual data, several approaches have leveraged audio transcripts obtained from narrated videos using automatic speech recognition (ASR) as a way to supervise video models for object detection [3,15,54], captioning [33,69], classification [2,42,47,86], summarization [57] or retrieval [50] using large-scale narrated video datasets such as How2 [65] or HowTo100M [50]. Others [10,30] have investigated learning from narrated videos by directly using the raw speech waveform instead of generating transcriptions.…”
Section: Related Workmentioning
confidence: 99%
“…In contrast, in this work no manually annotated visual data is involved at any stage of our approach. To avoid labelling visual data, several approaches have leveraged audio transcripts obtained from narrated videos using automatic speech recognition (ASR) as a way to supervise video models for object detection [3,15,54], captioning [33,69], classification [2,42,47,86], summarization [57] or retrieval [50] using large-scale narrated video datasets such as How2 [65] or HowTo100M [50]. Others [10,30] have investigated learning from narrated videos by directly using the raw speech waveform instead of generating transcriptions.…”
Section: Related Workmentioning
confidence: 99%
“…Multiple instance learning (MIL) [7,33] methods have been used for learning weakly supervised tasks such as object localization (WSOL) [25,8,53,41]. In a standard MIL framework, instance labels in each positive bag are treated as hidden variables with the constraint that at least one of them should be positive.…”
Section: Related Workmentioning
confidence: 99%
“…[3,13] focus on separating distinguishable audio and video objects simultaneously. [6] learn to associate tracklets with words in documentary subtitles. Most of these multi-modal methods primarily focus on captioning or retrieval tasks, while our main focus is localization.…”
Section: Related Workmentioning
confidence: 99%
“…We believe an essential step to scale up to millions of object classes is to use abundant and labor-free web data. One pioneering work is from Chen et al [6] which learns to discover and localize new objects from documentary videos by associating subtitles to video tracklets. There is also work to associate phrases in the caption to its visually depicted objects in the image [33,20].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation